Encodings in Strings are Evil Things (Part 2)

   At the end of the last post, we reduced the abstract concept of "string" down to an "ordered sequence of Unicode code points." (We did so by choosing to actively ignore glyph information, but we'll be coming back to it later.) Unicode code points are simply numbers; of course, numbers have to be reduced to binary to be stored in a computer. And someone who is reading a string from a file, or from memory, needs to use the exact same encoding scheme, or we're off in la-la land. And not all encodings are equal.

   First off, the simplest route. There are 231 possible Unicode code points, and an x86 register is 32 bits wide, so let's just add a zero and encode everything as a 32-bit unsigned binary! The ISO-10646 standard calls this UCS-4. Only one catch -- it doesn't specify endianness. Of course, this poses a problem if you want to trade text files between PCs and Macs. So, UCS-4 actually is three different encodings -- UCS-4LE (little endian), UCS-4BE (big endian), and just plain UCS-4, which means that no endian is specified and you should assume that it's the host's encoding unless told otherwise. (There are ways to tell otherwise -- but I'll mention them later.)

   Now, the ISO-10646 guys recognized that the majority of the written languages used on the Internet today can be expressed using a tiny subset of the 231 symbols, and it seems a waste to use four bytes for every character if the high bytes are 0 most of the time. So, ISO-10646 also defines UCS-2, which uses a 16-bit unsigned binary, but can only represent the lower 216 code points. (The lower 216 codepoints are thus referred to as the Basic Multilingual Plane, or BMP. This includes Latin, Greek, Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, as well as many mathematical symbols and a small set of basic Han ideographs.)

   This is the first encoding we'll encounter that is non-universal -- there are some strings that are expressible using Unicode characters which UCS-2 cannot be used to encode. Sadly, UCS-2 was adopted by early versions of the Unicode specification, and so UCS-2 is what most people think of when they hear "Unicode". We can't blame them, though -- it took until 2001 for ISO to use up all 216 code points in the BMP, and by then they were adding Han ideographs in bulk.

   Next, we start diving into encodings that were invented for reverse compatibility with older standards. As we said, early versions of Unicode specifiy UCS-2 as a standard, back when nothing existed in the UCS tables beyond the BMP. When it became obvious that eventually people would need to use codepoints beyond 216, a hybrid encoding called UTF-16 was created. The Unicode Consortium reserved a high range of codepoints (D800 to DFFF) to be used as "surrogate characters," so that up to 10242 characters above the BMP border could be represented as two consecutive surrogate characters, without breaking existing UCS-2 content. This adds a brand new level of complexity to string handling, because now a single codepoint could be either 2 or 4 bytes. This makes even simple tasks such as iterating over the string with a for-loop difficult.

   Later on, UTF-32 was introduced. UTF-32 is effectively identical to UCS-4 -- its sole difference is in the specification. UTF-32 claims that it should not be used to represent characters above 0x10FFFF. (Nothing is stopping it, though -- it's still just a unsigned long int.) I mention it mostly for completeness, and so you'll recognize the name. And don't forget that all of these encodings have endianness to worry about, so we've really covered 12 encodings for Unicode so far: UCS-4(BE/LE/host), UCS-2(BE/LE/host), UTF-16(BE/LE/host), and UTF32(BE/LE/host).

    (Windows-specific digression here: WCHAR is typedef'd inside winnt.h to wchar_t, whose size is determined by the compiler you're using. On Visual C++ .NET 2004, wchar_t is currently an 'unsigned short' and uses UCS-2LE; on gcc, unless specified otherwise it's an 'int'. The encoding for gcc varies by version and by compiler setting, though, and gcc 3.3 in particular is horribly buggy and can corrupt your string literals .)

   Everything's fine and dandy thus far, except for one catch -- we can't send strings in these encodings to old webservers that use C functions like strcmp(), strlen(), strcpy(), etc. -- or any other function that relies on the presence of a null byte to denote where the string ends. Why? Because, for any string that uses only the Latin alphabet (i.e. one that you could write in plain old ASCII), the first byte in any of the above encodings will be 00.

   Because of this, there's one more standard Unicode encoding, and that's the notorious UTF-8. UTF-8 can be thought of as a relative of Huffman encoding -- it guarantees that all codepoints less than or equal to 0x7F are encoded as single unsigned bytes (i.e. direct 7-bit ASCII correspondence), and that all codepoints greater than 0x7F are encoded as a multi-byte sequence. All bytes in a multi-byte sequence have their MSB set, and the first byte of such a codepoint contains the number of bytes that follow.

   So, in exchange for being a messy variable-length format that's hard to work with, UTF-8 can encode the entire set of Unicode codepoints and guarantees that any UTF-8 string will be correctly handled by a function expecting a null-terminated string. Also, since UTF-8 is specifically meant to be handled a byte at a time, it avoids the entire messy problem of endianness.

    (Historical note: UTF stands for UCS Transformation Format. The infamous Ken Thompson created UTF-8 in 1992 on a napkin in a New Jersey diner, for use in Plan9, and reported their success with it to the 1993 USENIX conference. Unicode and ISO both formally standardized it in 2001, although the Unicode adds the extra clause that it should not be used to express codepoints above 0x10FFFF, just like UTF-32.)

   Now, I mentioned earlier that the other formats had to choose endianness: explicitly specify, or shrug and assume that it's the same as the host. There is a common solution to this -- and that's to use a marker to determine the endianness. This marker is known as the Byte Order Mark, or BOM for short, and is Unicode code point 0xFEFF ("ZERO-WIDTH NO-BREAK SPACE" -- a null symbol, effectively). If you encounter the character 0xFFFE while decoding, you know that the file you're reading was written on a machine of opposite endianness, and you should flip bytes. (Unicode code point 0xFFFE has been specifically designated as an invalid character for this purpose.) Keep in mind that you may encounter multiple BOMs in a string and may have to switch back and forth! (This could happen if, for example, you used UNIX cat to concatenate two text files, and one was UCS-2BE and one was UCE-2LE.) UTF-8, being specifically designed to be parsed on a byte-by-byte basis, does not need a BOM. 

   There's a few other standard Unicode encodings, but the big 13 above are the only ones that you see regularly. I'll mention the other ones briefly, mainly because they show up in some old internet protocols:

  • UTF-7 was an early attempt to translate Unicode points to 7-bit-ASCII text for use in MIME-encoded emails. It specified that all 7-bit characters should be transmitted as single bytes, like UTF-8. However, rather than use the eighth bit to denote a multibyte character, it overloaded the + sign as a sentinel. "+-" denoted that a normal plus should appear; for any other following character, the following three bytes were the UCS-2 encoding, re-encoded in Base64. It could not transmit anything outside the BMP. UTF-7 is used, slightly modified, in parts of the IMAP mail protocol; for POP3 and SMTP, however, it has mostly been bypassed in favor of UTF-8.

  • SCSU (Standard Compression Scheme for Unicode) was an early attempt at a variable-length encoding like UTF-8 proposed by Reuters News, that added light compression as well. However, small compression schemes like this are painfully inefficient compared to larger schemes like LZW or BWT, and they makes it very difficult to handle internally. SCSU is not used in any major protocol or file format that I know of today.

  • Punycode (RFC 3942) is similar to UTF-7 and uses the string "xn--" as a sentinel. Punycode is only used in one situation -- the IDNA (Internationalizing Domain Names in Applications) protocol used to handle use of Unicode domain names in DNS. An IDNA-capable web browser will capture a string from the address bar, translate it to ASCII text using the Punycode system, and send the converted string as a standard getaddrbyname() DNS request, and the DNS server translates it back to Unicode upon reciept before doing the lookup. If you're making a better bind, or fixing Firefox, this will be of interest to you; I do not expect to encounter files or other strings encoded in this system.

   So, that's the set of encodings that directly encode Unicode code points. There's only one catch -- there's also encodings out there that don't directly map to Unicode codepoints! In this case, we have to do an two-part mapping to get to Unicode -- first, decoding to a symbol number in the source that matches that encoding's symbol set, and then converting that to a Unicode codepoint! Yuck. And we're going to encounter a lot of these too, because these have names we recognize like ASCII Code Page 437 and ISO 8859-1 and Windows DBCS and GB and Big5 -- all those legacy formats, some of which are also variable-length like UTF-8. We've got our work cut out for us!

   Of course, I never mentioned what it is we actually were working on... I'm out to create a string class for C++ that doesn't suck. Now, there's three ways that C++'s std::string sucks. The first sin is that you have to use a backing store -- you can't tell it to use a string literal like L"Schöne Grüße" as a source, since allocator<char> requires that the target be modifiable. All contents have to be copied, because contents are always mutable. The second sin is that it assumes that the compiler and author knows what they're doing when they manipulate its contents. To C++, a basic_string<T> is really just a pretty interface on a vector<T>. The third sin may vary to some people; for me, it exists in the forms of some stupid promises that 14882 (the ISO C++ standard) wasn't willing to make, most notably that the c_str() method is capable of invalidating references, pointers, and iterators. This was mostly done to accomodate copy on write and other implementation details, but it makes writing conformant string-handling code infuriatingly difficult if you ever have to interface std::string with C functions that need C strings (such as, say, the Win32 API!).

   I'm opting to only fix the first two sins for now -- backing store handling, and encoding awareness. The third sin, you can handle according to your level of offendedness.  Tomorrow: Policy-based design using templates, minimizing conversions, and why 14882's char_traits makes it impossible to make a strictly conformant std::string that supports variable-length encodings.

Today's facts/conclusions:

  • We have to store the code points in a string somehow.

  • A lot of pain comes from wanting to retain reverse compatibility with old character sets and old encodings.

  • Large fixed-width formats like UCS-2 and UCS-4 make string manipulation very easy since they allow random access to individual code points, but are not compatible with old C functions that expect null-terminated strings.  However, keep an eye out for endianness.

  • Variable-width formats like UTF-8 are compatible with null-termination functions, but have to be parsed sequentially.