New information has been added to this article since publication.
Refer to the Editor's Update below.

C++ At Work

Form Validation with Regular Expressions in MFC

Paul DiLascia

Code download available at:C.exe(248 KB)

I thought I'd use this month's column to describe an interesting app I built using the RegexWrap library described in my article "Wrappers: Use Our ManWrap Library to Get the Best of .NET in Native C++ Code" in this issue. RegexForm is a form validation system for MFC based on regular expressions. This app was my main reason for implementing RegexWrap in the first place. But since many of the details don't relate to regex per se, it makes more sense to describe RegexForm here.

[Editor's Update - 3/15/2005: The CAtlRegExp and CAtlREMatchContext classes in the ATL Server Library provide support for regular expressions.] So now that .NET provides a complete regular expressions library, why not use it in MFC apps? And with my RegexWrap library described in the main article, you don't even need the Managed Extensions or /clr.

MFC already has a mechanism for validating dialog input called Dialog Data Exchange (DDX) and Dialog Data Validation (DDV). Technically, DDX transfers data between the screen and your dialog object, whereas DDV validates the data. DDX begins when you call UpdateData from your dialog's OnOK handler:

// user pressed OK: void CMyDialog::OnOK() { UpdateData(TRUE); // get dialog data ... }

UpdateData is a virtual CWnd function you can override in your dialog. Its Boolean argument tells whether to copy information from the screen to your dialog object or vice versa. (You can call UpdateData(FALSE) from OnInitDialog to initialize your dialog.) The default CWnd implementation creates a CDataExchange object and passes it to another virtual function, DoDataExchange, which you're supposed to override to call specific DDX functions to transfer data for individual data members:

void CMyDialog::DoDataExchange(CDataExchange* pDX) { CDialog::DoDataExchange(pDX); DDX_Text(pDX, IDC_NAME, m_name); DDX_Text(pDX, IDC_AGE, m_age); ... // etc. }

Here IDC_NAME and IDC_AGE are the IDs of edit controls and m_name and m_age are CString and int data members. DDX_Text copies what the user entered for Name and Age into m_name and m_age (with an overload that converts Age to int along the way). The DDX functions know which way to go because CDataExchange::m_bSaveAndValidate is TRUE when copying from screen to dialog and FALSE going the other way. MFC has loads of DDX functions for all sorts of data and control types. For example, DDX_Text comes in at least a dozen overloaded flavors to copy and convert text input to various types like CString, int, double, COleCurrency and more. There's DDX_Check to convert a checkbox state to integer value, and DDX_Radio does the same for radio buttons.

DDX functions transfer the data; DDV functions validate it. For example, to limit the user's name to 35 characters, you'd write:

// in CMyDialog::DoDataExchange DDX_Text(pDX, IDC_NAME, m_sName); // get/set value DDV_MaxChars(pDX, m_sName, 35); // validate

And to restrict your user's age to an integer between 1 and 120, you'd write the following:

// m_age is int DDX_Text(pDX, IDC_AGE, m_age); DDV_MinMaxInt(pDX, m_age, 1, 120);

While DDX works well, DDV is a bit primitive. MFC has a limited repertoire of validations it can perform. You can limit the number of characters in a text field, and you can do min/max constraints on various types. Min/max is fine, but what if you want to validate zip codes or phone numbers? MFC has nothing for this. You have to write your own DDV functions. When I first implemented validation using regular expressions, all I did was write one function, like this:

void DDV_Regex(CDataExchange* pDX, CString& val, LPCTSTR pszRegex) { if (pDX->m_bSaveAndValidate) { CMRegex r(pszRegex); if (!r.Match(val).Success()) { pDX->Fail(); // throws exception } } }

This lets you easily validate input using regular expressions like so:

// in CMyDialog::DoDataExchange DDX_Text(pDX, IDC_ZIP, m_zip); DDV_Regex(pDX, m_zip,_T("^\\d{5}(-\\d{4})?$"));

Pretty cool for only four lines of code. (Of course, that's assuming you have RegexWrap—otherwise you have to call the Framework Regex class directly using the managed extensions.) DDV_Regex works perfectly well within MFC's DDX/DDV scheme, but as I started adding more fields, I soon discovered some major shortcomings with DDX/DDV. For one thing, each DDV function displays an error message box and throws an exception if the field is bad, so if there are five bad fields, the user gets five message boxes—ouch! Also, I didn't want to hardcode the regular expression in the call to DDV. But my main objection to DDX/DDV is it's so procedural. To validate a new field, you have to add another data member and more code in DoDataExchange, which pretty soon ends up rambling, like this:

DDX_Text(pDX, IDC_FOO,...); DDV_Mumble(pDX, ...) DDX_Text(pDX, IDC_BAR,...); DDV_Bletch(...) ... // etc for 14 lines

Why should I have to write procedural instructions to describe validation rules that are inherently static? One of my top five programming mantras is: abhor procedural code. Another one is: one table is worth a thousand lines of code. No doubt you can guess where this is heading. In the end, I wrote my own dialog validation system, one that's rule-based, ergo table-driven. It rides on top of DDX, but avoids DDV and has a much nicer user interface. It's easy to use and, of course, uses regular expressions to do the validating. Everything is encapsulated in a class CRegexForm you can use in any MFC dialog.

Figure 1 TestForm Tooltip Hints

Figure 1** TestForm Tooltip Hints **

Naturally, I wrote a test program to show how it works. At first glance TestForm looks like a vanilla MFC dialog-based app. Its main dialog has several edit fields—Zip Code, SSN (Social Security number), Phone Number, and so on. But when you test-drive TestForm, you quickly realize it has more under the hood than you thought. If you tab to an input field, TestForm displays a tooltip hint that describes what you can enter (see Figure 1). If you type an illegal character—for example, a letter in Phone Number—TestForm rejects the character and beeps. When you press Enter or OK, you'll get an error message like the one in Figure 2 describing all the bad fields. All the errors appear in one message box, as opposed to one message box for each error. Then when the user tabs to one of these bad fields, TestForm displays the error message again in an application-supplied "feedback" window inside the dialog itself (see Figure 3), so users don't have to remember what the error message said, the form reminds them as they correct each bad field. And if only one field is bad, TestForm forgoes the message box and instead displays the error in the feedback window.

Figure 2 Wrong Fields Aplenty

Figure 2** Wrong Fields Aplenty **

CRegexForm performs all this magic by itself. All you have to do is use it, and that's straightforward. First, you have to define your form. Here's how TestForm does it, in MainDlg.cpp:

// form/field map BEGIN_REGEX_FORM(MyRegexForm) RGXFIELD(IDC_ZIP,RGXF_REQUIRED,0) RGXFIELD(IDC_SSN,0,0) RGXFIELD(IDC_PHONE,0,0) RGXFIELD(IDC_TOKEN,0,0) RGXFIELD(IDC_PRIME,RGXF_CALLBACK,0) RGXFIELD(IDC_FAVCOL,0,CMRegex::IgnoreCase) END_REGEX_FORM()

The macros define a static table that describes each edit field. In most cases, all you need is the control ID, but there's also room for flags and RegexOptions. For example, in TestForm, Zip Code is required (RGXF_REQUIRED), Prime Number uses a callback (more shortly) and Favorite Columnist (IDC_FAVCOL) specifies CMRegex::IgnoreCase, which makes it case-insensitive.

Figure 3 Pietrek? I Think Not!

Figure 3** Pietrek? I Think Not! **

Looking at the table, you might wonder where the regular expressions are. Answer: in the resource file. For each field/control ID, CRegexForm expects a resource string with the same ID. The resource string comprises five substrings separated by newline characters ('\n'). The general format is "Name\nRegex\nLegalChars\nHint\nErrMsg". Here's the string for IDC_ZIP:

"Zip Code\n^\\d{5}(-\\d{4})?$\n[\\d-]\n##### or #####-####"

The first substring, "Zip Code," is the field name. The second, "^\d{5}(-\d{4})?$," is the regular expression used to validate the zip. (You have to type two backslashes in the resource string in order to escape a regular expression backslash.) The third substring is another regular expression describing the legal characters. For Zip, it's "[\d-]", which allows digits or hyphen. If your field has no character restrictions, you can omit LegalChars by typing two newlines ("\n\n") for an empty substring. The fourth substring with all the pound signs is the tooltip hint. Finally, you can provide a fifth substring, the error message to display if the field is bad. For Zip Code, there's no error message, so CRegexForm generates one of the form "Should be xxx", where xxx is replaced with the hint. "Should be" is itself another resource string (more on that shortly). Of all these substrings, only the first (field name) is required.

Why use resource strings to hold all this information instead of coding it directly in the field map? For one thing, putting it in the map would make the code ungainly. It's much more tidy to stash the strings out of the way. And since macros can't have optional arguments, you'd need several macros like RGXFIELD3, RGXFIELD4 and RGXFIELD5 depending how many arguments you want to use. How gauche is that? But the real reason for resource strings is to make localization easy. Translators can translate the strings and create resource DLLs for different locales. The regular expressions themselves might even require translation (ZIP codes look different in other countries like Britain or Botswana), so they go in the resource file too.

While I'm here, let me point out in passing how easy it is to parse these substrings with regular expressions. MFC has a 26-line function AfxExtractSubString to parse document substrings, but CRegexForm uses CMRegex to do it in one line!

CString str; str.LoadString(nID); vector<CString> substrs = CMRegex::Split(str, _T("\n"));

Now substrs[i] is the ith substring, and if you want to know how many there are, just call substrs.size(). I'm sure glad I wrapped the Split function to return an STL vector.

Once you've defined your field map using BEGIN/END_REGEX_FORM and written your resource strings, the next thing to do is to instantiate a CRegexForm in your dialog and initialize it:

// in OnInitDialog m_rgxForm.Init(MyRegexForm, this, IDS_MYREGEXFORM, MYWM_RGXFORM_MESSAGE);

Naturally, CRegexForm needs the field map and pointer to your dialog; the second and third arguments are another resource string and callback message ID. Like the individual field strings, the initialization string comprises multiple substrings separated by new lines. For TestForm, IDS_MYREGEXFORM is "Error: %s\nRequired\nShould be: %s\nBad Value". The first substring "Error: %s" is the error prefix. CRegexForm uses this to display "Error: xxx," where xxx is the actual error message. The second substring, "Required," is the word/phrase to use when a field is required (RGXF_REQUIRED). The third substring, "Should be: %s," is the one I described earlier. CRegexForm uses it to generate an error message "Should be: xxx" where xxx is the field hint. The last substring, "Bad Value," is a catch-all error message CRegexForm uses if the field has no hint and no error message. Users should never see this string because of course you're going to write a hint or error message for every field. Right?

The last Init argument, MYWM_RGXFORM_MESSAGE, is an app-defined callback message ID that lets CRegexForm talk to your app and do things like custom validation that requires procedural code. If you need to use mathematical algorithms or check the phase of the moon to validate your input, you can set RGXF_CALLBACK in your field flags and CRegexForm will send your dialog the callback message with notification code RGXNM_VALIDATEFIELD when it's time to validate. TestForm uses a callback to validate its Prime Number field; Figure 4 shows the details.

Figure 4 Procedural Validation Using Callback Message

////////////////// // Handle notification from Regex Form Manager: // Do custom validation for Prime Number. // LRESULT CMainDlg::OnRgxFormMessage(WPARAM wp, LPARAM lp) { UINT nID = LOWORD(wp); UINT nCode = HIWORD(wp); if (nCode==RGXNM_VALIDATEFIELD) { // custom validation: if (nID==IDC_PRIME) { const CString& val = *(CString*)lp; if (val.IsEmpty()) return RGXERR_OK; int p = _tstoi(val); return IsPrime(p) ? RGXERR_OK : RGXERR_NOMATCH; } ASSERT(FALSE); // shouldn't happen } return 0; }

CRegexForm does DDX using its own internal CStrings, so you don't have to define a dialog member for each text field. All you have to do is call CRegexForm to transfer the data.

void CMyDialog::DoDataExchange(CDataExchange* pDX) { CDialog::DoDataExchange(pDX); m_rgxForm.DoDataExchange(pDX); }

When you initialize CRegexForm, it allocates an array of protected FLDINFO structures, one for each field in your map. One of the FLDINFO members is FLDINFO::val, a CString that holds the current field value. Internally, CRegexForm uses DDX_Text with this CString. You can get or set the internal field values by calling CRegexForm::GetFieldValue or SetFieldValue, both of which use the control ID to identify the field.

m_rgxForm.SetFieldValue(IDC_ZIP,_T("10025"));

CRegexForm treats all values as text and stores them in CStrings, but provides GetFieldValInt and GetFieldValDouble methods to get a value as int or double. For other types, you have to do your own conversion—or you can still use MFC's DDX functions in DoDataExchange. TestForm has a Populate button that uses CRegexForm::SetFieldValue to fill the form with the sample data you see in Figure 3. In general, CRegexForm uses control IDs to identify fields. It has methods called GetFieldName, GetFieldHint, and GetFieldError to get a field's name, hint, and error code—all of which take the control ID as parameter.

So far I've shown you how to create the field map, write the resource strings, initialize your CRegexForm, and hook it up through DDX. All that remains is to actually validate the user's input. The time to do it is when the user presses OK:

void CMyDialog::OnOK() { UpdateData(TRUE); // copy screen->dialog int nBad = m_rgxForm.Validate(); if (nBad>0) { m_badFields = m_rgxForm.GetBadFields(); ... }

UpdateData invokes MFC's DDX mechanism, which calls your dialog's DoDataExchange. DoDataExchange then calls CRegexForm::DoDataExchange, which copies the user's input to its internal FLDINFO structures. Now CRegexForm::Validate iterates the fields, calling CMRegex::Match to validate each one against its regular expression. If the field is bad, CRegexForm sets an error code RGXERR_NOMATCH in its internal FLDINFO, or RGXERR_MISSING for required fields that are empty. Validate returns the number of bad fields. If there are any, you can call CRegexForm::GetBadFields to get an array (STL vector) of bad field IDs. You can then iterate the array to get each error code and error message. This is what CMainDlg in TestForm does to build its error message box, like the one in Figure 2. If only one field is bad, CMainDlg calls CRegexForm::ShowBadField to highlight the field and display an error message in the feedback window, as in Figure 3. If all the fields are OK, TestForm displays a message box showing the values you entered (see Figure 5). In a real app, you'd copy the values to their ultimate destinations. Figure 6 shows the full code for CMainDlg::OnOK. By decoupling data validation from data exchange, CRegexForm gives you greater control over your UI, and lets you avoid MFC's hardwired error messages.

Figure 6 CMainDlg::OnOK

////////////////// // User pressed OK: validate form and display results: error message // or field values—but don't call base class to end dialog. // void CMainDlg::OnOK() { UpdateData(TRUE); // get dialog data int nBad = m_rgxForm.Validate(); // validate CString msg; if (nBad>0) { vector<UINT> badFields = m_rgxForm.GetBadFields(); BOOL beep = TRUE; if (nBad>1) { // Multiple bad fields: show message box with bad fields. msg = _T("The following fields are bad:\n\n"); vector<UINT>::iterator it; for (it = badFields.begin(); it!=badFields.end(); it++) { UINT nID = *it; CString s; s.Format(_T("%s: %s\n"), m_rgxForm.GetFieldName(nID), m_rgxForm.GetFieldErrorMsg(nID)); msg += s; } MessageBox(msg,_T("Oops—Some fields are bad."), MB_ICONEXCLAMATION); beep = FALSE; // message box already beeped; don't beep again } // to highlight first bad field whether one or many UINT nID = badFields[0]; m_rgxForm.ShowBadField(nID, beep, TRUE); } else { // all fields OK: show feeback msg = _T("You Entered:\n\n"); for (int i=0; MyRegexForm[i].id; i++) { CString name = m_rgxForm.GetFieldName(MyRegexForm[i].id); CString val = m_rgxForm.GetFieldValue(MyRegexForm[i].id); if (val.IsEmpty()) val = _T("(nothing)"); CString temp; temp.Format(_T("%s = %s\n"), name, val); msg += temp; } MessageBox(msg,_T("Congratulations! All fields OK."),MB_OK); m_rgxForm.Feedback(_T(" All fields OK or empty!")); } }

Figure 5 Data Entered

Figure 5** Data Entered **

I mentioned the feedback window. CRegexForm manages it entirely; all you have to do is provide one by calling CRegexForm::SetFeedBackWindow. You can specify a color for error messages. CRegexForm also takes care of hints. By default, it displays the field hint whenever the user tabs to a new field (see Figure 1). You can call CRegexForm::SetShowHints(FALSE) to turn hints off. SetShowHints(TRUE, nDelay, nTimeout) turns them on again, where nDelay is the number of milliseconds to wait before showing the hint (default=250) and nTimeout is the number of milliseconds to show the hint (default=0, forever). CRegexForm automatically kills the hint when the user leaves the field (EN_KILLFOCUS). TestForm uses SetShowHints to implement a checkbox that turns hints on or off (see Figure 1).

There's one other feature I hesitate to mention. CRegexForm has an option to do immediate validation. I don't recommend it because I think it's bad GUI design, but there are times when you don't want to let the user tab to the next field without entering valid information. In that case, you can use RGXF_IMMED in your field's flags, or call SetValidateImmed to make all the fields validate immediately. TestForm has a checkbox to turn this immediate validation feature on or off. If you check it, you'll see why immediate validation is a bad idea.

Finally, what about min/max and character-limit constraints? Min/max is one thing regular expressions can't do. And while ".{0,35}" is a regex that describes all strings at most 35 characters long, you really want to use EM_LIMITTEXT to limit the edit control text length, so the control beeps when the user types too many characters. Since I couldn't bear to implement a form-validation system that lacks any feature already supported by MFC, I introduced the notion of "pseudo regular expressions." For example, the regex for IDC_AGE (the Age field) is "rgx:minmax:int:1:120,maxchars:3". Obviously, this isn't a real regular expression. It's a pseudo-expression that CRegexForm recognizes and interprets on its own. The general format is "rgx:expr,expr,..expr", where each expression describes a different constraint. Currently, only two are supported: "minmax:type:minval:maxval" (where type is int or double) and "maxchars:maxval". CRegexForm parses these specially, and uses EM_LIMITTEXT for maxchars, just like DDV_MinMaxInt. For details, download the source. As with resource substrings, regular expressions make parsing these expressions a piece o' cake.

I've shown you all the things CRegexForm can do. How does it work? There isn't enough space for a full explanation here, but I can sketch the big picture. CRegexForm uses my CSubclassWnd to subclass your dialog. For readers who haven't seen it before, CSubclassWnd is a class I wrote eons ago that uses Windows subclassing to trap messages sent to another window. What's so cool about CSubclassWnd is it lets you subclass an MFC window without inserting a new class into your hierarchy. I could've derived CRegexForm from CDialog, but then you'd have to derive your dialog from CRegexForm. And what if you've already implemented your own CBetterDialog derived from CDialog? In that case, you'd have to perform surgery to insert CRegexForm, with unpredictable results.

This is one of the big flaws in MFC: it uses derivation to implement subclassing, so the class hierarchy exactly mirrors the Windows subclassing. But there's no need for this, and lots of times it's preferable to write plug-in classes like CRegexForm that subclass your dialog without making you alter your class hierarchy. For me CSubclassWnd is so indispensable I can't program without it!

CRegexForm uses CSubclassWnd to intercept EN_KILLFOCUS and EN_SETFOCUS to hide and show hints, and EN_CHANGE to clear the field's error state as soon as the user types anything. CRegexForm implements the hints themselves using my CPopupText class, first described in my September 2000 column. To prevent users from typing disallowed characters, CRegexForm installs another CSubclassWnd-derived hook for each edit control that has a LegalChars regular expression. This nested class, CRegexForm::CEditHook, intercepts WM_CHAR messages sent to the edit control to eat-and-beep any characters that aren't allowed (eat the character and call MessageBeep). For details, you can download the source, which is in the RegexWrap download for my article "Wrappers: Use Our ManWrap Library to Get the Best of .NET in Native C++ Code" in this issue.

Send your questions and comments for Paul to  cppqa@microsoft.com.

Paul DiLascia is a freelance writer, consultant, and Web/UI designer-at-large. He is the author of Windows++: Writing Reusable Windows Code in C++ (Addison-Wesley, 1992). Paul can be reached at www.dilascia.com.