Globalization breaking changes

The following breaking changes are documented on this page:

Breaking change Version introduced
Globalization APIs use ICU libraries on Windows 5.0
StringInfo and TextElementEnumerator are now UAX29-compliant 5.0
"C" locale maps to the invariant locale 3.0

.NET 5.0

Globalization APIs use ICU libraries on Windows

.NET 5.0 and later versions use International Components for Unicode (ICU) libraries for globalization functionality when running on Windows 10 May 2019 Update or later.

Change description

Previously, .NET libraries used National Language Support (NLS) APIs for globalization functionality. For example, NLS functions were used to get culture data, such as date and time format patterns, compare strings, and perform string casing in the appropriate culture.

Starting in .NET 5.0, if an app is running on Windows 10 May 2019 Update or later, .NET libraries use ICU globalization APIs. Windows 10 May 2019 Update and later versions ship with the ICU native library. If the .NET runtime can't load ICU, it uses NLS instead.

This change was introduced for two reasons:

  • Apps have the same globalization behavior across platforms, including Linux, macOS, and Windows.
  • Apps can control globalization behavior by using custom ICU libraries.

Version introduced

.NET 5.0 Preview 4

No action is required on the part of the developer. However, if you wish to continue using NLS globalization APIs, you can set a run-time switch to revert to that behavior. For more information about the available switches, see the .NET globalization and ICU article.

Category

Globalization

Affected APIs


StringInfo and TextElementEnumerator are now UAX29-compliant

Prior to this change, System.Globalization.StringInfo and System.Globalization.TextElementEnumerator didn't properly handle all grapheme clusters. Some graphemes were split into their constituent components instead of being kept together. Now, StringInfo and TextElementEnumerator process grapheme clusters according to the latest version of the Unicode Standard.

In addition, the Microsoft.VisualBasic.Strings.StrReverse method, which reverses the characters in a string in Visual Basic, now also follows the Unicode standard for grapheme clusters.

Change description

A grapheme or extended grapheme cluster is a single user-perceived character that may be made up of multiple Unicode code points. For example, the string containing the Thai character "kam" (กำ) consists of the following two characters:

  • ก (= '\u0e01') THAI CHARACTER KO KAI
  • ำ (= '\u0e33') THAI CHARACTER SARA AM

When displayed to the user, the operating system combines the two characters to form the single display character (or grapheme) "kam" or กำ. Emoji can also consist of multiple characters that are combined for display in a similar way.

Tip

The .NET documentation sometimes uses the term "text element" when referring to a grapheme.

The StringInfo and TextElementEnumerator classes inspect strings and return information about the graphemes they contain. In .NET Framework (all versions) and .NET Core 3.x and earlier, these two classes use custom logic that handles some combining classes but doesn't fully comply with the Unicode Standard. For example, the StringInfo and TextElementEnumerator classes incorrectly split the single Thai character "kam" back into its constituent components instead of keeping them together. These classes also incorrectly split the emoji character "🤷🏽‍♀️" into four clusters (person shrugging, skin tone modifier, gender modifier, and an invisible combiner) instead of keeping them together as a single grapheme cluster.

Starting with .NET 5, the StringInfo and TextElementEnumerator classes implement the Unicode standard as defined by Unicode Standard Annex #29, rev. 35, sec. 3. In particular, they now return extended grapheme clusters for all combining classes.

Consider the following C# code:

using System.Globalization;

static void Main(string[] args)
{
    PrintGraphemes("กำ");
    PrintGraphemes("🤷🏽‍♀️");
}

static void PrintGraphemes(string str)
{
    Console.WriteLine($"Printing graphemes of \"{str}\"...");
    int i = 0;

    TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(str);
    while (enumerator.MoveNext())
    {
        Console.WriteLine($"Grapheme {++i}: \"{enumerator.Current}\"");
    }

    Console.WriteLine($"({i} grapheme(s) total.)");
    Console.WriteLine();
}

In .NET Framework and .NET Core 3.x and earlier versions, the graphemes are split up, and the console output is as follows:

Printing graphemes of "กำ"...
Grapheme 1: "ก"
Grapheme 2: "ำ"
(2 grapheme(s) total.)

Printing graphemes of "🤷🏽‍♀️"...
Grapheme 1: "🤷"
Grapheme 2: "🏽"
Grapheme 3: "‍"
Grapheme 4: "♀️"
(4 grapheme(s) total.)

In .NET 5 and later versions, the graphemes are kept together, and the console output is as follows:

Printing graphemes of "กำ"...
Grapheme 1: "กำ"
(1 grapheme(s) total.)

Printing graphemes of "🤷🏽‍♀️"...
Grapheme 1: "🤷🏽‍♀️"
(1 grapheme(s) total.)

In addition, starting in .NET 5, the Microsoft.VisualBasic.Strings.StrReverse method, which reverses the characters in a string in Visual Basic, now also follows the Unicode standard for grapheme clusters.

These changes are part of a wider set of Unicode and UTF-8 improvements in .NET, including an extended grapheme cluster enumeration API to complement the Unicode scalar-value enumeration APIs that were introduced with the System.Text.Rune type in .NET Core 3.0.

Version introduced

.NET 5.0 Preview 1

You don't need to take any action. Your apps will automatically behave in a more standards-compliant manner in a variety of globalization-related scenarios.

Category

Globalization

Affected APIs


.NET Core 3.0

"C" locale maps to the invariant locale

.NET Core 2.2 and earlier versions depend on the default ICU behavior, which maps the "C" locale to the en_US_POSIX locale. The en_US_POSIX locale has an undesirable collation behavior, because it doesn't support case-insensitive string comparisons. Because some Linux distributions set the "C" locale as the default locale, users were experiencing unexpected behavior.

Change description

Starting with .NET Core 3.0, the "C" locale mapping has changed to use the Invariant locale instead of en_US_POSIX. The "C" locale to Invariant mapping is also applied to Windows for consistency.

Mapping "C" to en_US_POSIX culture caused customer confusion, because en_US_POSIX doesn't support case insensitive sorting/searching string operations. Because the "C" locale is used as a default locale in some of the Linux distros, customers experienced this undesired behavior on these operating systems.

Version introduced

3.0

Nothing specific more than the awareness of this change. This change affects only applications that use the "C" locale mapping.

Category

Globalization

Affected APIs

All collation and culture APIs are affected by this change.