Dr. GUI .NET #5

 

May 31, 2002

Contents

Where We've Been; Where We're Going
Strings
The System.String Class
Strings in C#
Strings in Visual Basic .NET
Strings with Standard .NET Framework APIs
StringBuilder
System.Text Encoders and Decoders
Regular Expressions
Give It a Shot!
What We've Done; What's Next

Speak Out! Use the Source, Luke!

If you have something you want to say, tell us (and the world) what you think about this article on the Dr. GUI .NET message board.

And be sure to check out the samples running as Microsoft® ASP.NET applications, with source code, or just take a look at the source file in a new window.

Where We've Been; Where We're Going

Last time we discussed the mother and father of all .NET classes: the venerable System.Object. We also discussed memory allocation and garbage collection, and a bit about how the dynamic type system works in the .NET Framework.

This time we're going to talk about strings in the .NET runtime.

Strings

String (or, more specifically, System.String) is the type of strings, including string literals, in the .NET Framework. In the .NET Framework, a string holds zero or more 16-bit (two-byte) Unicode characters.

Strings are immutable

One of the hard things to get used to in the .NET Framework is that String objects are immutable, meaning once they're created their values cannot be changed. (However, you can reassign the string reference to refer to another string, freeing up the first string for garbage collection if no other references to it exist.)

The methods of String that appear to manipulate the string do not change the current string; instead, they create a new string and return it. Even changing, inserting, or deleting a single character causes a new string to be created and the old one to be thrown away.

Note that the process of repeatedly creating and throwing away strings can be slow. But making strings immutable has a number of advantages, in that ownership, aliasing, and threading issues are all much simpler with immutable objects. For instance, strings are always safe for multithreaded programming, since there's nothing a thread can do that would mess up another thread by modifying a string, since strings cannot be modified.

StringBuilder to the rescue

Enter StringBuilder. A StringBuilder object is not a string, but is used to manipulate strings of 16-bit Unicode characters. It contains a buffer, typically initialized with a string (but usually larger than that string). The characters in the buffer can be manipulated in place without creating a new buffer—you can insert, append, remove, and replace characters. (Of course, a new buffer is created if you insert enough characters to overflow the original buffer.) When you're done manipulating the characters, use StringBuilder's ToString method to extract the finished string from it. In most cases, StringBuilder.ToString operates without causing a copy operation, so it's very efficient.

Indexing Strings and StringBuilders

Since both of these types contain Unicode characters of type Char, they support an indexer called Chars that returns the single Char at the index you specify. If you're using C#, you can access this indexer like an array subscript.

Because String is immutable, its indexer is read only. Of course, the StringBuilder indexer is read/write, so you can both look at and change characters.

The System.String Class

System.String is a sealed class, meaning you cannot inherit from it (and also meaning the system can do certain optimizations to make string processing faster).

Most .NET languages will have some built-in support for strings. For instance, they typically support string literals and operations such as concatenation. But the syntax for that will vary from language to language.

Strings in C#

String Literals in C#

C# supports two types of string literals: "regular" string literals and "verbatim" string literals.

Regular string literals are similar to string literals in C and C++: They're delimited by quotation marks and can contain escape sequences to represent various control characters and any Unicode character. They may not extend to a new line, but adjacent string literals can be concatenated with the + operator at compile time. The escape sequences all begin with a backslash ("\") and are listed in the following table.

Table 1. C# string literal escape sequences

Escape Sequence Description
\t Tab (Unicode 0x0009).
\r Carriage return (0x000d).
\n Newline (line feed) (0x000a).
\v Vertical tab (0x000b).
\a Alert (0x0007).
\b Backspace (0x0008).
\f Form feed (0x000c).
\0 Null (0x0000).
\\ Backslash (0x005c).
\' Single quote (0x0027).
\" Double quote (0x0022).
\xD Hexadecimal character code with variable number of digits.
\uABCD Unicode character 0xABCD (either u or U is OK, where A, B, C, and D are valid hexadecimal digits 0-9, a-f, A-F).

All of these escape sequences can be used in any regular string literal or in any character literal (such as '\t'). In addition, the Unicode escape sequences can be used in identifiers, so "if (a\U0066b == true)" is the same as "if (afb == true)" (since Unicode 0x0066 is "f"). This allows you to write identifiers using any valid Unicode character, even if that character doesn't exist on your keyboard or can't be displayed in the font your editor is using.

Verbatim string literals begin with " and end with the matching quote. They do not have escape sequences. So

@"\\machine\share\path1\path2\file.ext"

is the same as

"\\\\machine\\share\\path1\\path2\\file.ext"

but is much simpler and less error-prone. Verbatim string literals also can extend over a line break. If they do, they include any white space characters between the quotes:

@"First \t line
   tabbed second line"
// same as "First \\t line\r\n\ttabbed second line"

Note Whether or not you need the "\r" in the second string depends on your editor. Microsoft® Visual Studio® .NET uses a carriage return/line feed pair at the end of each line, so you need both when you create the file using the Visual Studio .NET editor. (Other editors may end lines with just newline characters (ASCII code 10, or line feed).

The only exception to the "no escape sequence" rule for verbatim string literals is that you can put a double quotation mark inside a verbatim string by doubling it:

@"This is a ""quote"" from me."
// same as "This is a \"quote\" from me."

String Operations in C#

C# supports the following string operations in the language:

  • Indexing a string to read (but not write) the individual characters, as in s[i].
  • Concatenating two strings with the + operator, as in s + t. One of s or t can be a type other than string—if one is, it will be converted to string by calling ToString on it. The + operator is implemented as part of the C# language, not as a member of the String class. If either operand is null (as opposed to an empty string, ""), it will be converted to an empty string. If the operation can be performed at compile time, it will be.
  • Equality and inequality, as in s == t and s != t. These operators are part of String class, so they can be used from any language that supports overloaded operators. They can even be used by languages that don't support overloaded operators, such as Microsoft® Visual Basic® .NET, by using the member names op_Equality and op_Inequality rather than the operator syntax. These operators call String.Equals, which does a binary comparison of the characters in the two strings, not a culture-aware comparison.

Concatenation will be done at compile time, if possible. You can also concatenate any object to a string—doing so concatenates the return value from the object's ToString method. For instance, all of the following are valid:

string s0 = " Hello ";
string s1 = s0 + 5;   // " Hello 5"
string s2 = 6.3 + s1;   // "6.3 Hello 5"
string ret = String.Format("s0: {0}\ns1: {1}\ns2: {2}", s0, s1, s2);

In addition, because strings are immutable, the compiler and runtime will merge duplicate string literals so that there's only one copy of each literal in your program.

Strings in Visual Basic .NET

String Literals in Visual Basic .NET

Visual Basic .NET string literals are very simple: They consist of a double quotation mark followed by a set of Unicode characters followed by a closing double quotation mark. To include a double quotation mark within your string, double them, as in:

"This prints ""Hello, world!"" on the screen"

There is no escape sequence mechanism for Visual Basic .NET strings or character literals except for the doubling of the double quotation mark. (This is similar to the C# verbatim string literals, except that Visual Basic .NET string literals cannot include end-of-line characters—they must be contained on a single line.) If you need to include characters you can't represent in a Visual Basic .NET program, such as a line-feed character, you need use the Chr or ChrW function to create the character from its code, and then concatenate to the string. Note that using a carriage return (Chr(13)) would cause the two lines to be overwritten on the screen, since carriage return doesn't advance to a new line. You can use the Environment.NewLine property if you want both a carriage return and a line feed:

"First line" & Chr(10) & "Second line" ' LF only
"First line" & Environment.NewLine & "Second line" ' CR/LF

String Operations in Visual Basic .NET

Visual Basic .NET has all of C#'s string operators plus a few that C# doesn't. Visual Basic .NET's operators include:

  • Concatenation of a String with a String only using the + operator, as in s + t. This operator will do string concatenation only if both operands are strings. If either operand is anything other than a string, Visual Basic .NET will attempt to convert the string to Double and do numeric addition. Because of this, it's better to use the & operator to do concatenation, as described below.
  • Concatenation of any two objects using the & operator, as in s & "There". If the operands of the & operator are not strings, they're converted to strings. That means you could even concatenate the string representations of two numbers, x and y, for instance, with syntax like x & y. This operator is preferred for concatenation because it's unambiguous—it always does concatenation.
  • A complete set of comparisons: =, <>, >, <, >=, <=, which do either binary or text (culture-sensitive) comparison, depending on compilation option and the Option Compare statement, if any. StrComp also does string comparisons, by either using Option Compare to set the type of comparison or by passing a parameter that indicates which comparison method to use.
  • The Like operator allows you to determine if a string matches a pattern described by a simplified regular expression language. The regular expression language supported by the System.Text.RegularExpressions.RegEx is has more functionality, so you may want to use that API instead.
  • Visual Basic .NET doesn't have an indexer to allow you to access individual characters of strings using array subscript notation, but you can use the Chars property to get individual characters of strings and get/set individual characters of StringBuilders.

String Functions and Statements Only Found in Visual Basic .NET

In addition to the operators provided by the Visual Basic .NET language, there's a group of functions and statements for manipulating strings that are unique to Visual Basic .NET, and they're not in the standard .NET Framework library.

'If you're a Visual Basic .NET programmer, it's your choice whether to use the Visual Basic-only versions or the standard .NET Framework versions. (There are a couple of areas where Visual Basic provides functionality not provided by the Framework, however.) Dr. GUI recommends that you strongly consider sticking with the standard .NET Framework APIs where it's convenient to do so. Doing so will help you develop your skills in ways that can be used with other languages later on.

On the other hand, the Visual Basic-only versions can be extremely handy when you're porting older Visual Basic code to Visual Basic .NET, since the syntax and behavior of the Visual Basic-only versions is similar to that in older versions of Visual Basic.

Substrings

If you've programmed in BASIC at all, you'll remember the Left, Right, and Mid functions for extracting characters from the left, right, and middle of a string. You might even remember that Mid can be used on the left-hand side of an assignment operator to change characters in the middle of a string!

All of this functionality is in Visual Basic .NET—just as you remember it. One very important point: The index used in Mid for the starting character is one-based, not zero-based as with everything else in .NET. (That's because Mid has been one-based since the its beginning in BASIC as MID$.)

This is a serious enough problem that Dr. GUI recommends never using Mid in new code. Rather, reserve its use for code you're porting that already works—and get ready to pull your hair out when debugging; going back and forth between one-based and zero-based counting will drive you crazy.

The String.SubString method provides overloads that are a great replacement for Left and the Mid function. For Left, just use str.Substring(0, len); for Mid, use str.SubString(start, len), remembering to adjust the starting point by one since Substring bases the starting point at zero and Mid bases it at one.

Right is a little harder, since you have to calculate the starting position. For instance:

Dim s As String = "Hello, world!"
Dim t1 As String = s.Substring(s.Length - subLen)
      ' same as below
Dim t2 As String = Right(s, subLen)
Dim ret As String = String.Format("s0: {0}\nt1: {1}\nt2: {2}", s, t1, t2)
ret = ret.Replace("\n", Environment.NewLine)

Note the little trick Dr. GUI did here. He made up for the fact that Visual Basic .NET doesn't have an escape sequence for the newline character by including the escape sequence ("\n") anyways, then he replaced every occurrence of the escape sequence with the two-character Environment.NewLine string ("\r\n") just before he returned the string. If you want the output string to be exactly the same as in the C# version, you should replace "\n" with Chr(10) instead of Environment.NewLine.

The Mid statement, which replaces characters in a string, is a little trickier to replace. You could use String.Remove to remove the characters to be replaced and String.Insert to insert the new characters, but doing so requires creation of two new strings. It's more efficient to create a StringBuilder from the first substring of the string to be modified, append the replacement to it, and append the tail end of the original string to that, as in:

Dim s As String = "123456xxx0"
' SubString is ZERO-BASED
Dim sb As StringBuilder = New StringBuilder(s.Substring(0, 6))
sb.Append("789")
sb.Append(s.Substring(9, 1)) ' zero-based, 10th character
s = sb.ToString()
' same as Mid(s, 7, 3) = "789" ' one-based
Dim t As String = "123456xxx0"
Mid(t, 7, 3) = "789"
Dim ret As String = String.Format("s: {0}\nt: {1}", s, t)
ret = ret.Replace("\n", Environment.NewLine)

One very odd function

Visual Basic .NET includes one very odd function, StrConv. It converts a string to another string using a variety of conversions specified by a set of bits. It converts to upper case or to lower case using file system rules or culture-sensitive rules—lower-case with the first letter of every word capitalized, and several conversions that only make sense (and in fact will only work) if you're using an appropriate east Asian lanaguage (such as Japanese, Chinese, or Korean): to full/half width, and from Katakana to Hiragana and back (Japanese only). There is supposedly a conversion to/from simplified and traditional Chinese characters, but Dr. GUI is told this functionality doesn't work well, so he doesn't recommend using it.

There is no equivalent for most of these conversions in the .NET Framework API. However, the conversions are done using the Windows LCMapString API; if you really want to do them from a .NET language other than Visual Basic, use platform API interop to call LCMapString (just as Visual Basic apparently does).

Functions that match .NET Framework functionality

There are a bunch of Visual Basic .NET string functions that closely match .NET Framework functionality. Below is a table that lists the Visual Basic .NET function and the closest .NET Framework equivalent. Note that in some cases, the functionality is not exactly equivalent. Note also that in many cases, the .NET Framework versions have additional overloads that provide additional functionality. You'll need to read the documentation to do this as best as possible; use the table below to find the appropriate method to read up on.

Table 2. Visual Basic-specific string functions and their closest standard .NET Framework equivalents

Visual Basic .NET Function Semantics Closest .NET Framework Equivalent
StrReverse Reverse string. None—use StringBuilder and loop to reverse by swapping characters in place.
InStr, InStrRev Find first/last index of substring. IndexOf, LastIndexOf
LCase, UCase Convert to lower/upper case. ToLower, ToUpper
Format, FormatCurrency, FormatNumber, FormatPercent Culture-sensitive formatting of values to string representation. obj.ToString or String.Format
Str, Val Convert number to/from string; not culture sensitive. One of above with CultureInfo.Invariant specified.
Trim, LTrim, RTrim (spaces only) Trims spaces from ends of strings. String.Trim, String.TrimStart, String.TrimEnd (Can do characters besides spaces.)
Len Returns length of string. String.Length
Space, StrDup Create string with repeated spaces or other characters. String constructor that takes character and count.
Replace Replace a sub-string within a string. Replace in either String or StringBuilder.
Split, Join Create an array of strings by breaking a string at a specified delimiter or create a delimited string from an array of strings. String.Split, String.Join
Filter Creates array of strings from another array that contain a substring. No equivalent; write for-each loop, perhaps putting strings into a dynamically-expandable ArrayList.
AscW, ChrW Convert Unicode integer to/from single-character string. Cast from char to an integer type or visa-versa.
Asc, Chr Convert code-page integer to/from single-character string. Use an encoding (described later).
CStr Convert to string. obj.ToString

Strings with Standard .NET Framework APIs

Creating String Objects

In both languages, you can create a String object with the new/New statement. Most of the CLI-compliant String constructors take nothing, a character and an integer, or an array of characters. So, unless you have an array of characters, you'll probably use another method to create your strings. There is no constructor that takes a string. However, there is a constructor that takes a Char and an integer count—the resulting string will have the character repeated as many times as the count specifies.

The most common methods will initialize with a string literal and assign a value from another string:

[C#]
string s = "Hello, world!";
string t = s;   // no copy; s and t refer to same string
string ret = String.Format("s and t refer to same: {0}",
   Object.ReferenceEquals(s, t));

[Visual Basic .NET]
Dim s as string = "Hello, world"
Dim t as string = s ' no copy; s and t refer to same string
Dim ret as string = String.Format("s and t refer to same: {0}", s Is t)

When is Clone not Clone?

But what if you want to have two separate strings, each containing the same value? Well, generally you do not want this. Why waste the memory? And because strings are immutable, there's not much point in having two separate strings that have the same value.

So, although String implements IClonable, String.Clone simply returns a reference to the same string without cloning it.

All is not lost, however: You can use the static method Copy if you insist on having a second copy of the string. Note that we're checking for reference equality in two ways: first, by calling Object.ReferenceEquals, and second, by casting the references to object before testing for equality in C#, and by using the Is operator in Visual Basic.

[C#]
string s = "Hello";
string t = (string)s.Clone();   // no copy; s and t refer to same string
string u = String.Copy(s); // makes copy; s and u refer to diff. objects
string ret = String.Format("s same as t: {0}, s same as u: {1}",
   Object.ReferenceEquals(s, t), (object)s == (object)u);

[Visual Basic .NET]
Dim s as string = "Hello"
Dim t as string = CStr(s.Clone()) ' no copy; s and t refer to same string
Dim u as string = String.Copy(s)  ' makes copy; s and u refer to 2 objects
Dim ret as string = String.Format("s same as t: {0}, s same as u: {1}", _
       Object.ReferenceEquals(s, t), s Is u)

You can also copy some or all of a string into a character array with the String.CopyTo method. (Use a string constructor that takes a character array as a parameter to create a string from a character array or portion thereof.)

Other interfaces, plus properties, and a field

As we've seen, String implements ICloneable, although it does it in an odd (but reasonable) way. It also implements IComparable. That means you can compare two strings using IComparable's CompareTo method (more on this later). Although String implements IConvertable, you should use the methods in the Convert class to convert a string to all of the of built-in value types using the To… methods.

String also has a few interesting properties. You can find the length (in characters, not bytes) of any string by accessing its Length property (which is read-only). And you can access the string one character at a time through the indexer property Chars (which is mapped to the C# indexer, and is also read-only):

[C#]
string s = "Hello, world!";
StringBuilder sb = new StringBuilder(String.Format(
"Length of \"Hello, world!\" is {0}\n", s.Length)); // 13
for (int i = 0; i < s.Length; i++)
   sb.AppendFormat("{0} ", s[i]);

[Visual Basic .NET]
Dim s as string = "Hello, world!"
Dim sb as StringBuilder = new StringBuilder( _
String.Format("Length of ""Hello, world!"" is {0}", s.Length)) ' 13
sb.Append(Environment.NewLine)
Dim i as Integer
For i = 0 to s.Length - 1
   sb.AppendFormat("{0} ", s.Chars(i))
Next

Note that since string characters are indexed starting with zero, even in Visual Basic .NET, we have to be careful to write our loop properly (the same issue is handled by using the < operator rather than <= in C#).

Oh, and there's an interesting static field called Empty, which contains the empty string. This gives you a language-independent way of expressing an empty string ("").

String equality/comparison methods

There are a variety of ways to compare two strings. For equality and inequality, one major difference, of course, is whether the comparison is done by reference (the two strings point to the same object), or by value (they contain the same characters).

For both equality and relational comparisons, the other major difference is whether to use the current culture's collating order or the raw ordinal values for each character in the string. (A minor difference is whether the comparison is case sensitive or not.) The default for comparisons is that they use the current culture of the thread they're running on and that they're case sensitive. This is generally what you want.

The == operator generates a call to String.Equals, which does a culture-sensitive, case-sensitive comparison. If you want to do a reference comparison in C#, cast both of the string references to Object or use Object.ReferenceEquals. In Visual Basic .NET, use the Is operator (in Visual Basic) as below, or use Object.ReferenceEquals. Note that you can use Object.ReferenceEquals in either language.

[C#]
      string s = "Hello", t = "there";
      bool valueComp = (s == t); // value comparison
      bool refComp1 = ((Object)s == (Object)t); // reference comparison
      bool refComp2 = Object.ReferenceEquals(s, t); // referece comparison
      string ret = String.Format("s == t: {0}, " +
         "(object)s == (object)t: {1}, ObjectRefEq(s, t): {2}",
         valueComp, refComp1, refComp2);

[Visual Basic .NET]
Dim s as string = "Hello"
Dim t as string = "there"
Dim valueComp as Boolean = s = t ' value comparison
Dim refComp1 as Boolean = s Is t ' reference comparison
Dim refComp2 as Boolean  = Object.ReferenceEquals(s, t) ' ref. comparison
Dim ret as string = _
String.Format("s = t: {0}, s Is t: {1}, ObRefEq(s, t): {2}", _
   valueComp, refComp1, refComp2)

Visual Basic .NET's equality, inequality, and comparison operators (=, <>, >, <, >=, <=) do either ordinal comparisons or culture-sensitive comparisons, depending on the compiler options and the Option Compare statement. To do a reference comparison, use the Is operator. Visual Basic .NET also has a Like operator for string comparisons that does simple pattern matching.

The Equals method comes in several flavors. There are two instance methods: one takes a String as its parameter, and one takes an Object (which must refer to a String). There is also a static version of Equals, which takes two strings. The version that takes an Object is overridden (originally defined in Object), the one that takes a string is faster because no conversion is necessary, and the static version can handle being passed null/Nothing without throwing an exception.

For comparisons, there are three methods: Compare, CompareOrdinal, and CompareTo. All of these methods return a negative number if the first string is less than the second, zero if they have the same value, and a positive number if the first string is greater than the second.

Compare is a set of overloaded static methods that do culture-sensitive comparisons, case-sensitive by default. Each of the overloads takes at least the two strings to compare; there are also overloads to take a Boolean that specifies case sensitivity and Int32s to specify indices and length of substrings to compare (rather than comparing the whole string).

CompareOrdinal is a pair of overloaded static methods that do ordinal comparisons of two strings or substrings within the two strings.

CompareTo is an instance method that implements IComparable. It compares the current string with the string or object passed as a parameter using culture-sensitive and case-sensitive comparisons.

Searching a string: EndsWith/StartsWith/IndexOf/LastIndexOf/Substring

There are a bunch of methods for figuring out what's in a string. EndsWith and StartsWith return true if the current string ends/starts with the specified string, or false otherwise.

IndexOf and LastIndexOf each have a bunch of overloads. Each returns the first/last position in the current string (or a substring of it) at which the string (or character array) occurs.

Substring is like the old GW-BASIC MID$ function and the Visual Basic .NET Mid function, except that the index into the string is zero-based instead of one-based as in Mid/MID$. There are two overloads: One creates and returns a new string that contains the substring from the specified index to the end of the string. The other creates and returns a new string that contains the substring of the specified length starting at the specified index.

Formatting

The String class has a family of static Format method overloads. All of them provide formatting abilities the same as the ones we've been using in our calls to Console.WriteLine. They create and return a new string with the format specifier replaced by the string representation of the remaining parameters. For instance, you could use:

[C#]
string ret = String.Format("The value is {0}", 5);
// "The value is 5" (without quotes)

[Visual Basic .NET]
Dim ret as string = String.Format("The value is {0}", 5)
' "The value is 5" (without quotes)

There are other ways to format data as well, and a lot of issues around culture awareness. Unfortunately, we don't have the time to go into them this time around. Note that in general, the .NET Framework will do culturally-aware conversions and formatting, unless you pass CultureInfo.InvariantCulture to the formatting method. The Visual Basic .NET built-in methods often default to invariant culture, so be aware!

Parsing

You can convert from a string to many other types by using the Parse method of the class you want to convert to. For instance, to convert to an integer, you might use the Int32.Parse method. Note that there are other ways of doing conversions, such as using the methods in the Convert class, or you could in some cases cast the object (in C#), or use Visual Basic's CType or CStr functions.

Padding and trimming

There are a variety of methods for padding the left or right part of a string with blanks (or any character you like) called PadLeft and PadRight. There are also several methods for trimming the blanks or any set of characters you specify off the beginning and/or end of a string (TrimStart, TrimEnd, and Trim). These all create new strings and return them.

Insert/Remove/Replace/Concat

These methods are somewhat similar to the ones found in StringBuilder, although there are fewer overloads of them in String than in StringBuilder. See the section to follow, StringBuilder, for more on the differences.

Insert creates a new string with the specified string inserted at the specified position. Remove creates a new string with the specified number of characters removed from the specified position. Replace creates and returns a new string with all occurrences of a specified character or string replaced by another character or string. And the various overloads of the static method Concat create and return new strings by concatenating the strings and/or objects it's passed.

Split/Join

Split and Join are powerful methods. Join is a static method that returns a string created by concatenating some or all of the strings in the string array it's passed, using a specified string as a separator. Split splits a string into a newly created and returned array of strings, using the specified set of separator characters.

Conversions

String doesn't contain any directly accessable conversion methods, but the Convert class does contain a bunch of static/Shared conversion methods, most of which convert to some built-in type and are of the form To…. The list that converts to built-in types to a string includes ToBoolean, ToByte, ToChar, ToDecimal, ToDouble, ToInt16, ToInt32, ToInt64, ToSByte, ToSingle, ToString, ToUInt16, ToUInt32, and ToUInt64. There is also a method to convert to a DateTime object—ToDateTime—and methods (FromBase64String and ToBase64String) to convert between byte arrays and base-64-encoded strings (useful for transferring binary data over text protocols).

String does contain a method ToCharArray that converts to a character array. And there are methods to convert the string to upper or lower case: ToUpper and ToLower.

Intern/IsInterned

The .NET runtime maintains a pool of literal strings in the application domain (roughly similar to a process). When it loads an assembly, such as your program, into an application domain, it merges the assembly's string literals into the application domain's pool of string literals, thereby avoiding duplication of string literals. (This can be done without changing your program's results, because strings are immutable.) The fact that all string literals are interned also means you can do reference comparisons of string literals correctly, since there's only one copy of each string literal.

However, if you build a string yourself, such as through one of the String methods that creates a new string, or by using a StringBuilder, it will be a different string from the string with the same value in the literal pool (assuming there is one), so reference comparisons will fail. This could be a big deal because reference comparisons are much faster than value comparisons—all that has to be compared is the address. (So in code that you really want to run fast, you might want to use reference comparisons if you're sure they'll work!)

If you want to add your strings into the literal pool, you can do so with the static method Intern, which adds the string to the literal pool (if it's not there already) and returns a reference to the literal pool string. For instance:

[C#]
string s2 = new StringBuilder().Append("Foo").Append("Bar").ToString();
string s3 =String.Intern(s2);   // return reference to literal pool
string s = "FooBar";            // was in literal pool all along
StringBuilder sb = new StringBuilder();
sb.Append(Object.ReferenceEquals(s2, s));   // false: different
sb.Append(", ");
sb.Append(Object.ReferenceEquals(s3, s));   // true: same

[Visual Basic .NET]
Dim s2 as string = New _
     StringBuilder().Append("Foo").Append("Bar").ToString()
Dim s3 as string = String.Intern(s2) ' return reference to literal pool
Dim s as string = "FooBar"  ' was in literal pool all along
Dim sb as new StringBuilder()
sb.Append(s2 Is s) ' false: different
sb.Append(", ")
sb.Append(s3 Is s) ' true: same

You can also check to see if a string exists in the pool already with the static IsInterned method, which returns true if the string is in the pool.

GetHashCode

You might remember that last time we discussed the idea that if you override Equals, you should also override GetHashCode. In String, Equals is overridden, so GetHashCode is as well, and provides good hashing and good performance.

StringBuilder

You may have noticed that many of the methods in String create and return new strings. As you may have guessed, this can be expensive if you allocate and throw away many strings.

As a rule, if you're going to do only one string operation on a particular string that involves creating a new string, go ahead and use the appropriate method in String (or the appropriate Visual Basic .NET function). But if you're doing more than one, and if the operations you need are available in System.Text.Stringbuilder, create a StringBuilder from your string, do the multiple operations on the StringBuilder, and then call ToString on the StringBuilder to get the result string back. See the preceding example in which this is done all in one line.

It's very common to build a string in a StringBuilder, manipulate it, and convert it to a string. Note that the call to ToString won't actually copy the string unless you later modify the same StringBuilder object, so it's actually very efficient—the overhead for using the StringBuilder is generally one copy operation, not two. That's why we have the one-operation rule. For two or more operations, using a StringBuilder is at least as good.

You may also want to use a StringBuilder if the overloads available are considerably easier to use than the more limited overloads in String.

To use StringBuilder you'll need to include using System.Text; (Imports System.Text in Visual Basic) at the beginning of your program file.

StringBuilder has several constructors that can initialize the object from a string and set its capacity and maximum capacity. (The default maximum capacity is about two billion characters, so it's not unreasonable to set it to a somewhat smaller value, although there's no penalty if you don't.) The capacity of the object will be adjusted upwards as needed, as long as you don't exceed the maximum capacity. The maximum capacity cannot be adjusted after the StringBuilder is constructed.

There are four properties: Capacity, which can be read or written; MaxCapacity, which is read-only; Length, the current length of the string (can be set shorter or longer); and Chars, an indexer that allows you to read and write individual characters. The EnsureCapacity method allows you to increase capacity if it's not already large enough.

The Append method allows you to append a string or any type (there are many overloads) to the end of the StringBuilder. If you pass another type, its ToString method will be called and the result appended to the StringBuilder.

The AppendFormat method allows you to format the string you append to the StringBuilder. The formatting is the same as String.Format and Console.WriteLine.

The many overloads of Insert allow you to insert a string (perhaps calculated from a call to your parameter's ToString method) anywhere in the StringBuilder.

Remove allows you to remove any number of characters from any position.

Replace allows replacement of individual characters, or, as just mentioned, of substrings.

Finally, the overloads of ToString create and return a new String object from the StringBuilder (or the specified substring of it).

System.Text Encoders and Decoders

You've noticed that all of the strings in a .NET Framework program are stored as 16-bit Unicode. You may also have noticed that not everyone uses Unicode, so sometimes you'll have to convert from some other character encoding to Unicode, or from Unicode to some other character encoding.

The .NET Framework provides several classes to do encoding (converting Unicode characters to a block of bytes in another encoding) and decoding (converting a block of bytes in some encoding to Unicode characters).

There is a class for each supported encoding: ASCIIEncoding, CodePageEncoding, UnicodeEncoding (useful for converting from big-endian to little-endian), UTF7Encoding, and UTF8Encoding.

Each of these classes has methods for both encoding (such as GetBytes) and decoding (such as GetChars) for encoding and decoding a single array all at once. In addition, each supports GetEncoder and GetDecoder, which return encoders and decoders capable of maintaining shift state, so they can be used with streams and blocks.

Regular Expressions

You may have noticed that Visual Basic .NET supports limited regular expression matching through the Like statement. In addition, the .NET Framework supports some very sophisticated regular expression processing in the System.Text.RegularExpressions namespace. The key class here is System.Text.RegularExpressions.RegEx, which represents a compiled regular expression and provides methods for using it, including finding and replacing regular expressions in strings, and splitting a string into an array of strings. The documentation on regular expressions is sparse now, so we won't go into it yet. Perhaps in a future column.

Give It a Shot!

Who's on Your .NET Framework Learning Team?

The good doctor hopes that you're not only playing with .NET, but that you're also working with some other folks. It's more fun that way, and you're guaranteed to learn more.

Some Things to Try...

Take a look at the source file. Copy it into Visual Studio and make your own modifications.

Play with strings some, and perhaps with arrays of strings.

Do some string manipulations with StringBuilder. And use some of the advanced functions of String to parse a string, perhaps a command line.

If you're really up for some fun, try encoding and decoding different character sets and code pages. If you're up for some real trailblazing, work some with the regular expression classes.

What We've Done; What's Next

This time, we talked about strings. Next time, we'll talk about arrays.