System.Char struct

Article
01/08/2024

This article provides supplementary remarks to the reference documentation for this API.

The Char structure represents Unicode code points by using UTF-16 encoding. The value of a Char object is its 16-bit numeric (ordinal) value.

If you aren't familiar with Unicode, scalar values, code points, surrogate pairs, UTF-16, and the Rune type, see Introduction to character encoding in .NET.

This article examines the relationship between a Char object and a character and discuss some common tasks performed with Char instances. We recommend that you consider the Rune type, introduced in .NET Core 3.0, as an alternative to Char for performing some of these tasks.

Char objects, Unicode characters, and strings

A String object is a sequential collection of Char structures that represents a string of text. Most Unicode characters can be represented by a single Char object, but a character that is encoded as a base character, surrogate pair, and/or combining character sequence is represented by multiple Char objects. For this reason, a Char structure in a String object is not necessarily equivalent to a single Unicode character.

Multiple 16-bit code units are used to represent single Unicode characters in the following cases:

Glyphs, which may consist of a single character or of a base character followed by one or more combining characters. For example, the character ä is represented by a Char object whose code unit is U+0061 followed by a Char object whose code unit is U+0308. (The character ä can also be defined by a single Char object that has a code unit of U+00E4.) The following example illustrates that the character ä consists of two Char objects.

using System;
using System.IO;

public class Example1
{
    public static void Main()
    {
        StreamWriter sw = new StreamWriter("chars1.txt");
        char[] chars = { '\u0061', '\u0308' };
        string strng = new String(chars);
        sw.WriteLine(strng);
        sw.Close();
    }
}
// The example produces the following output:
//       ä

open System
open System.IO

let sw = new StreamWriter("chars1.txt")
let chars = [| '\u0061'; '\u0308' |]
let string = String chars
sw.WriteLine string
sw.Close()

// The example produces the following output:
//       ä

Imports System.IO

Module Example2
    Public Sub Main()
        Dim sw As New StreamWriter("chars1.txt")
        Dim chars() As Char = {ChrW(&H61), ChrW(&H308)}
        Dim strng As New String(chars)
        sw.WriteLine(strng)
        sw.Close()
    End Sub
End Module
' The example produces the following output:
'       ä

Characters outside the Unicode Basic Multilingual Plane (BMP). Unicode supports sixteen planes in addition to the BMP, which represents plane 0. A Unicode code point is represented in UTF-32 by a 21-bit value that includes the plane. For example, U+1D160 represents the MUSICAL SYMBOL EIGHTH NOTE character. Because UTF-16 encoding has only 16 bits, characters outside the BMP are represented by surrogate pairs in UTF-16. The following example illustrates that the UTF-32 equivalent of U+1D160, the MUSICAL SYMBOL EIGHTH NOTE character, is U+D834 U+DD60. U+D834 is the high surrogate; high surrogates range from U+D800 through U+DBFF. U+DD60 is the low surrogate; low surrogates range from U+DC00 through U+DFFF.

using System;
using System.IO;

public class Example3
{
    public static void Main()
    {
        StreamWriter sw = new StreamWriter(@".\chars2.txt");
        int utf32 = 0x1D160;
        string surrogate = Char.ConvertFromUtf32(utf32);
        sw.WriteLine("U+{0:X6} UTF-32 = {1} ({2}) UTF-16",
                     utf32, surrogate, ShowCodePoints(surrogate));
        sw.Close();
    }

    private static string ShowCodePoints(string value)
    {
        string retval = null;
        foreach (var ch in value)
            retval += String.Format("U+{0:X4} ", Convert.ToUInt16(ch));

        return retval.Trim();
    }
}
// The example produces the following output:
//       U+01D160 UTF-32 = ð (U+D834 U+DD60) UTF-16

open System
open System.IO

let showCodePoints (value: char seq) =
    let str =
        value
        |> Seq.map (fun ch -> $"U+{Convert.ToUInt16 ch:X4}")
        |> String.concat ""
    str.Trim()

let sw = new StreamWriter(@".\chars2.txt")
let utf32 = 0x1D160
let surrogate = Char.ConvertFromUtf32 utf32
sw.WriteLine $"U+{utf32:X6} UTF-32 = {surrogate} ({showCodePoints surrogate}) UTF-16"
sw.Close()

// The example produces the following output:
//       U+01D160 UTF-32 = ð (U+D834 U+DD60) UTF-16

Imports System.IO

Module Example4
    Public Sub Main()
        Dim sw As New StreamWriter(".\chars2.txt")
        Dim utf32 As Integer = &H1D160
        Dim surrogate As String = Char.ConvertFromUtf32(utf32)
        sw.WriteLine("U+{0:X6} UTF-32 = {1} ({2}) UTF-16",
                   utf32, surrogate, ShowCodePoints(surrogate))
        sw.Close()
    End Sub

    Private Function ShowCodePoints(value As String) As String
        Dim retval As String = Nothing
        For Each ch In value
            retval += String.Format("U+{0:X4} ", Convert.ToUInt16(ch))
        Next
        Return retval.Trim()
    End Function
End Module
' The example produces the following output:
'       U+01D160 UTF-32 = ð (U+D834 U+DD60) UTF-16

Characters and character categories

Each Unicode character or valid surrogate pair belongs to a Unicode category. In .NET, Unicode categories are represented by members of the UnicodeCategory enumeration and include values such as UnicodeCategory.CurrencySymbol, UnicodeCategory.LowercaseLetter, and UnicodeCategory.SpaceSeparator, for example.

To determine the Unicode category of a character, call the GetUnicodeCategory method. For example, the following example calls the GetUnicodeCategory to display the Unicode category of each character in a string. The example works correctly only if there are no surrogate pairs in the String instance.

using System;
using System.Globalization;

class Example
{
   public static void Main()
   {
      // Define a string with a variety of character categories.
      String s = "The red car drove down the long, narrow, secluded road.";
      // Determine the category of each character.
      foreach (var ch in s)
         Console.WriteLine("'{0}': {1}", ch, Char.GetUnicodeCategory(ch));
   }
}
// The example displays the following output:
//      'T': UppercaseLetter
//      'h': LowercaseLetter
//      'e': LowercaseLetter
//      ' ': SpaceSeparator
//      'r': LowercaseLetter
//      'e': LowercaseLetter
//      'd': LowercaseLetter
//      ' ': SpaceSeparator
//      'c': LowercaseLetter
//      'a': LowercaseLetter
//      'r': LowercaseLetter
//      ' ': SpaceSeparator
//      'd': LowercaseLetter
//      'r': LowercaseLetter
//      'o': LowercaseLetter
//      'v': LowercaseLetter
//      'e': LowercaseLetter
//      ' ': SpaceSeparator
//      'd': LowercaseLetter
//      'o': LowercaseLetter
//      'w': LowercaseLetter
//      'n': LowercaseLetter
//      ' ': SpaceSeparator
//      't': LowercaseLetter
//      'h': LowercaseLetter
//      'e': LowercaseLetter
//      ' ': SpaceSeparator
//      'l': LowercaseLetter
//      'o': LowercaseLetter
//      'n': LowercaseLetter
//      'g': LowercaseLetter
//      ',': OtherPunctuation
//      ' ': SpaceSeparator
//      'n': LowercaseLetter
//      'a': LowercaseLetter
//      'r': LowercaseLetter
//      'r': LowercaseLetter
//      'o': LowercaseLetter
//      'w': LowercaseLetter
//      ',': OtherPunctuation
//      ' ': SpaceSeparator
//      's': LowercaseLetter
//      'e': LowercaseLetter
//      'c': LowercaseLetter
//      'l': LowercaseLetter
//      'u': LowercaseLetter
//      'd': LowercaseLetter
//      'e': LowercaseLetter
//      'd': LowercaseLetter
//      ' ': SpaceSeparator
//      'r': LowercaseLetter
//      'o': LowercaseLetter
//      'a': LowercaseLetter
//      'd': LowercaseLetter
//      '.': OtherPunctuation

open System

// Define a string with a variety of character categories.
let s = "The red car drove down the long, narrow, secluded road."
// Determine the category of each character.
for ch in s do
    printfn $"'{ch}': {Char.GetUnicodeCategory ch}"

// The example displays the following output:
//      'T': UppercaseLetter
//      'h': LowercaseLetter
//      'e': LowercaseLetter
//      ' ': SpaceSeparator
//      'r': LowercaseLetter
//      'e': LowercaseLetter
//      'd': LowercaseLetter
//      ' ': SpaceSeparator
//      'c': LowercaseLetter
//      'a': LowercaseLetter
//      'r': LowercaseLetter
//      ' ': SpaceSeparator
//      'd': LowercaseLetter
//      'r': LowercaseLetter
//      'o': LowercaseLetter
//      'v': LowercaseLetter
//      'e': LowercaseLetter
//      ' ': SpaceSeparator
//      'd': LowercaseLetter
//      'o': LowercaseLetter
//      'w': LowercaseLetter
//      'n': LowercaseLetter
//      ' ': SpaceSeparator
//      't': LowercaseLetter
//      'h': LowercaseLetter
//      'e': LowercaseLetter
//      ' ': SpaceSeparator
//      'l': LowercaseLetter
//      'o': LowercaseLetter
//      'n': LowercaseLetter
//      'g': LowercaseLetter
//      ',': OtherPunctuation
//      ' ': SpaceSeparator
//      'n': LowercaseLetter
//      'a': LowercaseLetter
//      'r': LowercaseLetter
//      'r': LowercaseLetter
//      'o': LowercaseLetter
//      'w': LowercaseLetter
//      ',': OtherPunctuation
//      ' ': SpaceSeparator
//      's': LowercaseLetter
//      'e': LowercaseLetter
//      'c': LowercaseLetter
//      'l': LowercaseLetter
//      'u': LowercaseLetter
//      'd': LowercaseLetter
//      'e': LowercaseLetter
//      'd': LowercaseLetter
//      ' ': SpaceSeparator
//      'r': LowercaseLetter
//      'o': LowercaseLetter
//      'a': LowercaseLetter
//      'd': LowercaseLetter
//      '.': OtherPunctuation

Imports System.Globalization

Module Example1
    Public Sub Main()
        ' Define a string with a variety of character categories.
        Dim s As String = "The car drove down the narrow, secluded road."
        ' Determine the category of each character.
        For Each ch In s
            Console.WriteLine("'{0}': {1}", ch, Char.GetUnicodeCategory(ch))
        Next
    End Sub
End Module
' The example displays the following output:
'       'T': UppercaseLetter
'       'h': LowercaseLetter
'       'e': LowercaseLetter
'       ' ': SpaceSeparator
'       'r': LowercaseLetter
'       'e': LowercaseLetter
'       'd': LowercaseLetter
'       ' ': SpaceSeparator
'       'c': LowercaseLetter
'       'a': LowercaseLetter
'       'r': LowercaseLetter
'       ' ': SpaceSeparator
'       'd': LowercaseLetter
'       'r': LowercaseLetter
'       'o': LowercaseLetter
'       'v': LowercaseLetter
'       'e': LowercaseLetter
'       ' ': SpaceSeparator
'       'd': LowercaseLetter
'       'o': LowercaseLetter
'       'w': LowercaseLetter
'       'n': LowercaseLetter
'       ' ': SpaceSeparator
'       't': LowercaseLetter
'       'h': LowercaseLetter
'       'e': LowercaseLetter
'       ' ': SpaceSeparator
'       'l': LowercaseLetter
'       'o': LowercaseLetter
'       'n': LowercaseLetter
'       'g': LowercaseLetter
'       ',': OtherPunctuation
'       ' ': SpaceSeparator
'       'n': LowercaseLetter
'       'a': LowercaseLetter
'       'r': LowercaseLetter
'       'r': LowercaseLetter
'       'o': LowercaseLetter
'       'w': LowercaseLetter
'       ',': OtherPunctuation
'       ' ': SpaceSeparator
'       's': LowercaseLetter
'       'e': LowercaseLetter
'       'c': LowercaseLetter
'       'l': LowercaseLetter
'       'u': LowercaseLetter
'       'd': LowercaseLetter
'       'e': LowercaseLetter
'       'd': LowercaseLetter
'       ' ': SpaceSeparator
'       'r': LowercaseLetter
'       'o': LowercaseLetter
'       'a': LowercaseLetter
'       'd': LowercaseLetter
'       '.': OtherPunctuation

Internally, for characters outside the ASCII range (U+0000 through U+00FF), the GetUnicodeCategory method depends on Unicode categories reported by the CharUnicodeInfo class. Starting with .NET Framework 4.6.2, Unicode characters are classified based on The Unicode Standard, Version 8.0.0. In versions of .NET Framework from .NET Framework 4 to .NET Framework 4.6.1, they are classified based on The Unicode Standard, Version 6.3.0.

Characters and text elements

Because a single character can be represented by multiple Char objects, it is not always meaningful to work with individual Char objects. For instance, the following example converts the Unicode code points that represent the Aegean numbers zero through 9 to UTF-16 encoded code units. Because it erroneously equates Char objects with characters, it inaccurately reports that the resulting string has 20 characters.

using System;

public class Example5
{
    public static void Main()
    {
        string result = String.Empty;
        for (int ctr = 0x10107; ctr <= 0x10110; ctr++)  // Range of Aegean numbers.
            result += Char.ConvertFromUtf32(ctr);

        Console.WriteLine("The string contains {0} characters.", result.Length);
    }
}
// The example displays the following output:
//     The string contains 20 characters.

open System

let result =
    [ for i in 0x10107..0x10110 do  // Range of Aegean numbers.
        Char.ConvertFromUtf32 i ]
    |> String.concat ""

printfn $"The string contains {result.Length} characters."


// The example displays the following output:
//     The string contains 20 characters.

Module Example5
    Public Sub Main()
        Dim result As String = String.Empty
        For ctr As Integer = &H10107 To &H10110     ' Range of Aegean numbers.
            result += Char.ConvertFromUtf32(ctr)
        Next
        Console.WriteLine("The string contains {0} characters.", result.Length)
    End Sub
End Module
' The example displays the following output:
'     The string contains 20 characters.

You can do the following to avoid the assumption that a Char object represents a single character:

You can work with a String object in its entirety instead of working with its individual characters to represent and analyze linguistic content.

You can use String.EnumerateRunes as shown in the following example:

int CountLetters(string s)
{
    int letterCount = 0;

    foreach (Rune rune in s.EnumerateRunes())
    {
        if (Rune.IsLetter(rune))
        { letterCount++; }
    }

    return letterCount;
}

let countLetters (s: string) =
    let mutable letterCount = 0

    for rune in s.EnumerateRunes() do
        if Rune.IsLetter rune then
            letterCount <- letterCount + 1

    letterCount

You can use the StringInfo class to work with text elements instead of individual Char objects. The following example uses the StringInfo object to count the number of text elements in a string that consists of the Aegean numbers zero through nine. Because it considers a surrogate pair a single character, it correctly reports that the string contains ten characters.

using System;
using System.Globalization;

public class Example4
{
    public static void Main()
    {
        string result = String.Empty;
        for (int ctr = 0x10107; ctr <= 0x10110; ctr++)  // Range of Aegean numbers.
            result += Char.ConvertFromUtf32(ctr);

        StringInfo si = new StringInfo(result);
        Console.WriteLine("The string contains {0} characters.",
                          si.LengthInTextElements);
    }
}
// The example displays the following output:
//       The string contains 10 characters.

open System
open System.Globalization

let result =
    [ for i in 0x10107..0x10110 do  // Range of Aegean numbers.
        Char.ConvertFromUtf32 i ]
    |> String.concat ""


let si = StringInfo result
printfn $"The string contains {si.LengthInTextElements} characters."

// The example displays the following output:
//       The string contains 10 characters.

Imports System.Globalization

Module Example6
    Public Sub Main()
        Dim result As String = String.Empty
        For ctr As Integer = &H10107 To &H10110     ' Range of Aegean numbers.
            result += Char.ConvertFromUtf32(ctr)
        Next
        Dim si As New StringInfo(result)
        Console.WriteLine("The string contains {0} characters.", si.LengthInTextElements)
    End Sub
End Module
' The example displays the following output:
'       The string contains 10 characters.

If a string contains a base character that has one or more combining characters, you can call the String.Normalize method to convert the substring to a single UTF-16 encoded code unit. The following example calls the String.Normalize method to convert the base character U+0061 (LATIN SMALL LETTER A) and combining character U+0308 (COMBINING DIAERESIS) to U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS).

using System;

public class Example2
{
    public static void Main()
    {
        string combining = "\u0061\u0308";
        ShowString(combining);

        string normalized = combining.Normalize();
        ShowString(normalized);
    }

    private static void ShowString(string s)
    {
        Console.Write("Length of string: {0} (", s.Length);
        for (int ctr = 0; ctr < s.Length; ctr++)
        {
            Console.Write("U+{0:X4}", Convert.ToUInt16(s[ctr]));
            if (ctr != s.Length - 1) Console.Write(" ");
        }
        Console.WriteLine(")\n");
    }
}
// The example displays the following output:
//       Length of string: 2 (U+0061 U+0308)
//
//       Length of string: 1 (U+00E4)

open System

let showString (s: string) =
    printf $"Length of string: {s.Length} ("
    for i = 0 to s.Length - 1 do
        printf $"U+{Convert.ToUInt16 s[i]:X4}"
        if i <> s.Length - 1 then printf " "
    printfn ")\n"

let combining = "\u0061\u0308"
showString combining

let normalized = combining.Normalize()
showString normalized

// The example displays the following output:
//       Length of string: 2 (U+0061 U+0308)
//
//       Length of string: 1 (U+00E4)

Module Example3
    Public Sub Main()
        Dim combining As String = ChrW(&H61) + ChrW(&H308)
        ShowString(combining)

        Dim normalized As String = combining.Normalize()
        ShowString(normalized)
    End Sub

    Private Sub ShowString(s As String)
        Console.Write("Length of string: {0} (", s.Length)
        For ctr As Integer = 0 To s.Length - 1
            Console.Write("U+{0:X4}", Convert.ToUInt16(s(ctr)))
            If ctr <> s.Length - 1 Then Console.Write(" ")
        Next
        Console.WriteLine(")")
        Console.WriteLine()
    End Sub
End Module
' The example displays the following output:
'       Length of string: 2 (U+0061 U+0308)
'       
'       Length of string: 1 (U+00E4)

Common operations

The Char structure provides methods to compare Char objects, convert the value of the current Char object to an object of another type, and determine the Unicode category of a Char object:

To do this	Use these `System.Char` methods
Compare Char objects	CompareTo and Equals
Convert a code point to a string	ConvertFromUtf32 See also the Rune type.
Convert a Char object or a surrogate pair of Char objects to a code point	For a single character: Convert.ToInt32(Char) For a surrogate pair or a character in a string: Char.ConvertToUtf32 See also the Rune type.
Get the Unicode category of a character	GetUnicodeCategory See also Rune.GetUnicodeCategory.
Determine whether a character is in a particular Unicode category such as digit, letter, punctuation, control character, and so on	IsControl, IsDigit, IsHighSurrogate, IsLetter, IsLetterOrDigit, IsLower, IsLowSurrogate, IsNumber, IsPunctuation, IsSeparator, IsSurrogate, IsSurrogatePair, IsSymbol, IsUpper, and IsWhiteSpace See also corresponding methods on the Rune type.
Convert a Char object that represents a number to a numeric value type	GetNumericValue See also Rune.GetNumericValue.
Convert a character in a string into a Char object	Parse and TryParse
Convert a Char object to a String object	ToString
Change the case of a Char object	ToLower, ToLowerInvariant, ToUpper, and ToUpperInvariant See also corresponding methods on the Rune type.

Char values and interop

When a managed Char type, which is represented as a Unicode UTF-16 encoded code unit, is passed to unmanaged code, the interop marshaller converts the character set to ANSI by default. You can apply the DllImportAttribute attribute to platform invoke declarations and the StructLayoutAttribute attribute to a COM interop declaration to control which character set a marshaled Char type uses.

Share via