Handling Sorting in Your Applications
Some applications, such as Microsoft Active Directory, Microsoft Exchange, and Microsoft Access, maintain a sortable database of locale and language strings indexed by name (UTF-16 string), and their associated sorting weights.
Sorting is usually intuitive for users in their own locales. However, it can be non-intuitive for application developers. This topic discusses considerations for handling sorting in your applications. Sorting can be either linguistic or ordinal (non-linguistic).
You can use a variety of sorting functions in your applications:
- NLS string comparison functions. Examples are CompareString and CompareStringEx, CompareStringOrdinal, LCMapString, LCMapStringEx, FindNLSString, FindNLSStringEx, and FindStringOrdinal. See Security Considerations: International Features for a discussion of security issues related to the string comparison functions.
- Wrapper functions that internally call the string comparison functions. The most common functions are lstrcmp and lstrcmpi, which call CompareString.
Usually the sorting functions evaluate strings character by character. However, many languages have multiple-character elements, such as the two-character pair "CH" in traditional Spanish. CompareString and CompareStringEx use the application-supplied locale identifier or name to identify multiple-character elements. In contrast, lstrcmp, and lstrcmpi use the user's locale.
Another example is Vietnamese, which contains many two-character elements, such as the valid uppercase, title case, and lowercase forms of "GI", which are "GI, "Gi", and "gi", respectively. Any of these forms is treated as a as a single sorting element and, if casing is ignored, compares as equal. However, because "gI" is not valid as a single element, CompareString, CompareStringEx, lstrcmp, and lstrcmpi treat "gI" as two separate elements.
The functions CompareString, CompareStringEx, lstrcmp, lstrcmpi, LCMapString, LCMapStringEx, FindNLSString, and FindNLSStringEx all default to use of a "word sort" technique. For this type of sort, all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently from the other nonalphanumeric characters to ensure that words such as "coop" and "co-op" stay together in a sorted list.
Instead of a word sort, the application can request a "string sort" technique from the sorting functions by specifying the SORT_STRINGSORT flag. A string sort treats the hyphen and apostrophe just like any other nonalphanumeric character. Their positions in the sorting sequence are before the alphanumeric characters.
The following table compares the results of a word sort with the results of a string sort.
|Word Sort||String Sort|
Sort Strings Linguistically
For compatibility with Unicode, an application should prefer CompareStringEx or the Unicode version of CompareString. Another reason for preferring CompareStringEx is that Microsoft is migrating toward the use of locale names instead of locale identifiers for new locales, for interoperability reasons. Any application that runs only on Windows Vista and later should use CompareStringEx.
Another way of testing for linguistic equality is to use lstrcmp or lstrcmpi, which always use a word sort. The lstrcmpi function calls CompareString with the NORM_IGNORECASE flag, while lstrcmp calls it without that flag. For an overview of the use of the wrapper functions, see Strings.
The functions retrieve linguistically appropriate results for all locales. User expectations for different locales can differ significantly in sorting behavior, as shown in the following examples.
- Many locales equate the ae ligature (æ) with the letters ae. However, Icelandic (Iceland) considers it a separate letter and places it after Z in the sorting sequence.
- The A Ring (Å) normally sorts with merely a diacritic difference from A. However, Swedish (Sweden) places the A Ring after Z in the sorting sequence.
The functions attempt to verify rigorously that code points defined in the Unicode standard are canonically equal to a string of equivalent code points. For example, the code point that represents a lowercase "u" with a dieresis (ü) is canonically equal to a lowercase "u" combined with the dieresis (¨). Note, however, that canonical equivalence is not always possible.
As almost all data entered using Windows keyboards and input method editors (IMEs) conforms to the form C normalization defined in the Unicode standard, converting incoming data from other platforms using the NLS Unicode normalization functions provides most consistent results, especially for locales that use the Tibetan script or the Hangul script for modern Hangul. For more information on Unicode normalization support in Windows Vista and later, see Using Unicode Normalization to Represent Strings.
When string comparison follows the user's language preference, for example, when sorting items for an ordered ListView control, the application can do one of the following:
- Call lstrcmp or lstrcmpi with the user's locale.
- Call CompareString or CompareStringEx to define a locale for the comparison, to pass additional flags, to embed null characters, or to pass explicit lengths to match parts of a string.
When the results of the comparison should be consistent regardless of locale, for example, when comparing retrieved data against a predefined list or an internal value, the application should use CompareString or CompareStringEx with the Locale parameter set to LOCALE_INVARIANT. For CompareString, either of the following calls will match even if mystr is "INLAP". In this case, a locale-sensitive call to lstrcmpi will fail if the current locale is Vietnamese.
On Windows XP:
int iReturn = CompareString(LOCALE_INVARIANT, NORM_IGNORECASE, mystr, -1, _T("InLap"), -1);
On earlier operating systems:
DWORD lcid = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT); int iReturn = CompareString(lcid, NORM_IGNORECASE, mystr, -1, _T("InLap"), -1);
Sort Strings Ordinally
For ordinal (non-linguistic) sorting, your applications should always use the CompareStringOrdinal function.
This function is only available for Windows Vista and later.
CompareStringOrdinal compares two Unicode strings to test for binary equality, as opposed to linguistic equality. Examples of such non-linguistic strings are NTFS file names, environment variables, and the names of mutexes, named pipes, or mailslots. Except for the option of case-insensitivity, this function disregards all non-binary equivalences. Unlike some other sorting functions, it tests all code points for equality, including those that are not given any weight in linguistic sorting schemes.
- Canonically equivalent sequences in Unicode, such as LATIN SMALL LETTER A WITH RING ABOVE (U+00e5) and LATIN SMALL LETTER A + COMBINING RING ABOVE (U+0061 U+030a), are not equal even though they appear identical ("å").
- Canonically similar strings in Unicode, such as LATIN LETTER SMALL CAPITAL Y (U+028f) and LATIN CAPITAL LETTER Y (U+0059), which look very similar ("ʏ" and "Y") and vary only by some special case weights in the linguistic tables, are considered to be entirely dissimilar characters. Even if the application sets bIgnoreCase to TRUE, these strings compare as different.
- Code points that are defined but have no linguistic sorting weight, such as ZERO WIDTH JOINER (U+200d), are treated as having their code point weights.
- Code points that are defined in later versions of Unicode but have no weight in current linguistic tables are treated as having their code point weights.
- Code points that are undefined by Unicode are treated as having their code point weights.
- When the application sets bIgnoreCase to TRUE, the function maps case using the operating system uppercasing table, instead of the information in the linguistic sorting tables. Thus the mapping is independent of locale.
For more information about canonically equivalent sequences in Unicode and canonically similar strings in Unicode, see Using Unicode Normalization to Represent Strings.
Sort Code Points
Some Unicode code points have no weight, for example, ZERO WIDTH NON JOINER, U+200c. The sorting functions intentionally evaluate the no-weight code points as equivalent because they have no weight in sorting. On Windows Vista and later, the application can sort these code points by calling the NLS string comparison functions, particularly CompareStringOrdinal, for evaluation of all code points in a literal, binary sense, for example, in password validation. On pre-Windows Vista operating systems, the application should use the C runtime function strcmp or wcscmp.
Sorting functions ignore diacritics, such as NON SPACING BREVE, U+0306, when the application specifies the hlink_NONSPACE flag. Similarly, these functions ignore symbols, for example, EQUALS SIGN, U+003d , when the hlink_SYMBOLS flag is specified. On Windows Vista and later, the application calls CompareStringOrdinal for evaluation of diacritics and symbol code points in a literal, binary sense. On pre-Windows Vista operating systems, the application should use strcmp or wcscmp.
Some code points, such as 0xFFFF and 0x058b, are currently not assigned in Unicode. These code points do not receive any weight in sorting, and should never be passed to the sorting functions. The application should use IsNLSDefinedString to detect non-Unicode code points in a data stream.
Results of IsNLSDefinedString might vary depending on the Unicode version passed if a character is added to Unicode in a later version and it is subsequently added to the Windows sorting tables. For more information, see Use Sort Versioning.
Sort Digits as Numbers
On Windows 7 and later, the application can call CompareString, CompareStringEx, LCMapString, or LCMapStringEx using the SORT_DIGITSASNUMBERS flag. This flag supports sorting that treats digits as numbers, for example, sorting of "2" before "10".
Note that the use of this flag is not appropriate for hexadecimal digits such as the following.
In this case the "numbers" are sorted in order, but the user perceives a poorly sorted hexadecimal list.
When transforming between uppercase and lowercase, the function always maps a single character to a single character. For example, the LCMAP_LOWERCASE and LCMAP_UPPERCASE flags map the German Sharp S ("ß") to itself. The LCMAP_UPPERCASE flag does not map "ß" to "SS". The LCMAP_LOWERCASE flag never maps "SS" to "ß".
When transforming between uppercase and lowercase, the function is not sensitive to context. For example, while the LCMAP_UPPERCASE flag correctly maps both Greek lowercase sigma ("σ") and Greek lowercase final sigma ("ς") to Greek uppercase sigma ("Σ"), the LCMAP_LOWERCASE flag always maps "Σ" to "σ", never to "ς".
By default, the function maps the lowercase "i" to the uppercase "I", even when the Locale parameter specifies Turkish or Azerbaijani. To override this behavior for Turkish or Azerbaijani, the application should specify LCMAP_LINGUISTIC_CASING. If this flag is specified with the appropriate locale, "ı" (lowercase dotless I) is the lowercase form of "I" (uppercase dotless I) and "i" (lowercase dotted I) is the lowercase form of "İ" (uppercase dotted I).
If the LCMAP_HIRAGANA flag is specified to map katakana characters to hiragana characters, and LCMAP_FULLWIDTH is not specified, LCMapString or LCMapStringEx only maps full-width characters to hiragana. In this case, any half-width katakana characters are placed as in the destination string, with no mapping to hiragana. The application must specify LCMAP_FULLWIDTH to map half-width katakana characters to hiragana. The reason for this restriction is that all hiragana characters are full-width characters.
If the application needs to strip characters from the source string, it can call the mapping function with the NORM_IGNORESYMBOLS and NORM_IGNORENONSPACE flags set, and all other flags cleared. If the application does this with a source string that is not null-terminated, it is possible for the function to return an empty string and not return an error.
Create Sort Keys
When the application specifies LCMAP_SORTKEY, LCMapString or LCMapStringEx generates a sort key, a binary array of byte values. The sort key is not a true string and its values represent the sorting behavior of the source string, but are not meaningful display values.
The function ignores the Arabic kashida during generation of a sort key. If an application calls the function to create a sort key for a string containing an Arabic kashida, the function creates no sort key value.
The sort key can contain an odd number of bytes. The LCMAP_BYTEREV flag only reverses an even number of bytes. The last byte (odd-positioned) in the sort key is not reversed. If the terminating 0x00 byte is an odd-positioned byte, it remains the last byte in the sort key. If the terminating 0x00 byte is an even-positioned byte, it exchanges positions with the byte that precedes it.
When generating the sort key, the function treats the hyphen and apostrophe differently from other punctuation symbols, so that words such as "coop" and "co-op" stay together in a list. All punctuation symbols other than the hyphen and apostrophe sort before alphanumeric characters. The application can change this behavior by setting the SORT_STRINGSORT flag, as described in Sorting Functions.
When used in memcmp, the sort key produces the same order as when the source string is used in CompareString or CompareStringEx. The memcmp function should be used instead of strcmp, because the sort key can have embedded null bytes.
Use Sort Versioning
A sorting table has two numbers that identify its version: the defined version and the NLS version. Both numbers are DWORD values, composed of a major value and a minor value. The first byte of a value is reserved, the next two bytes represent the major version, and the last byte represents the minor version. In hexadecimal terms, the pattern is 0xRRMMMMmm, where R equals Reserved, M equals major, and m equals minor. For example, a major version of 3 with a minor version of 4 is represented as 0x304.
The defined version identifies the repertoire of code points and is the same for all locales. The major version increments to indicate changes to existing code points. The minor version increments to indicate that code points have been added, but that no previously existing code points have been changed.
The NLS version is specific to a locale identifier or locale name, and tracks changes to code point weights for the affected locale. The major version increments when weights are changed for code points that were already sortable. The minor version increments when new code points are assigned weights, but all other previously sortable code point weights remain unchanged.
For a major version, one or more code points are changed so that the application must re-index all data for comparisons to be valid. For a minor version, nothing is moved but code points are added. For this type of version, the application only has to re-index strings with previously unsortable values.
The major version has been changed in Windows 8. Data created under earlier versions of Windows must be re-indexed.
Both the defined and NLS versions apply to sortable code points retrieved using the LCMapString or LCMapStringEx function with the LCMAP_SORTKEY flag, and also used by the CompareString, CompareStringEx, FindNLSString, and FindNLSStringEx functions. If one or more code points in a string are unsortable, then the IsNLSDefinedString function returns FALSE when that string is passed to it as a parameter.
Index the Database
For performance reasons, the application should follow this procedure when indexing the database.
To properly index the database
- For each function, store the NLS version, the sort keys of that version, and an indication of sortability for each indexed string.
- When the minor version increments, re-index previously unsortable strings. The strings affected in this update should be confined to the ones for which IsNLSDefinedString has previously returned FALSE.
- When the major version increments, re-index all strings because the updated weights might change the behavior of any string. Major version releases are very infrequent.
Database indexing problems can arise for the following reasons:
- A later operating system can define code points that are undefined for an earlier operating system, thus changing the sort.
- Code points can have different sorting weights in different operating systems, due to corrections in language support.
To minimize the necessity to re-index the database in these circumstances, the application can use IsNLSDefinedString to differentiate defined from undefined strings so that the application can reject strings with undefined code points. Use of GetNLSVersion or GetNLSVersionEx allows the application to determine if an NLS change affects the locale used for a particular index table. If the change has no effect on the locale, the application has no need to re-index the table.
The following table illustrates the effects of certain flags used with the sorting functions. In each case, the selection of flags determines whether two different characters are considered equal for sorting purposes.
|Character 1||Character 2||Default||NORM_IGNOREWIDTH||NORM_IGNOREKANA||NORM_IGNOREWIDTH| NORMIGNOREKANA|
U+3042 HIRAGANA LETTER A
U+30A2 KATAKANA LETTER A
U+FF75 HALFWIDTH KATAKANA LETTER O
U+30AA KATAKANA LETTER O
U+FF22 FULLWIDTH LATIN CAPITAL LETTER B
U+0042 LATIN CAPITAL LETTER B