Encodings In Strings Are Evil Things (Part 1)

Article
10/18/2004

What is a string?

About six months ago at the Game Developers Conference in San Jose, I sat in on a talk about performance tuning in Xbox games. The presenter had a slide that read: "Programmers love strings. Love hurts. " This was shown while he described a game which was using a string identifier for every object in the game world and hashing on them, and was incurring a huge performance hit from thousands of strcmp()s each frame. I nodded -- but my mind was thinking, "The same would be true if they had used GUIDs, or any other large identifier. After all, strcmp is just a bounded memcmp." So, what actually IS a string?

I think it's safe to say that a string is something that a human interprets and derives meaning from. In this case, that something is almost always an ordered sequence of symbols (note: the symbols may not be co-linear!) that conveys meaning. Now, let's assume from here on that a string is an ordered sequence of 2D glyphs. A glyph is three pieces of data: a symbol, the dimensions to render that symbol at, and the location where it should be rendered. This is still describing a very abstract, human-centric thing. To express this in the programming world, we have to identify these glyphs somehow. A vector drawing or bitmap approximation of a glyph would suffice. But we don't want to require that people deal with these just to print "Hello World" to the screen. So, let's put the glyphs somewhere in the system, and assign indices to them.

And thus, we have ISO 10646, known as the Universal Character Set or UCS. UCS is a simple mapping of decimal indices (called code points) and formal names, to symbols. For example, in the UCS, code point 0x41 is "Latin capital letter A" and corresponds to, of course, the letter A. The goal of UCS is to be a superset of all character sets. So, given a set of characters such as 7-bit ASCII, or ISO 8859-1, or EBCDIC, we can find some mapping (preferably 1:1, but we're not always so lucky) to UCS. So, our definition of glyph now converts to a tuple containing a UCS code point, a size, and a distance from the last render point.

We now find ourselves asking a few more questions about glyphs. Size is fairly easy to measure -- just a box that bounds the symbol. However, distance is difficult, because good typesetting requires that the distance between characters be measured from any number of points inside that box. For simple Roman alphabets, we might want to measure from the baseline; accents might have to go relative to baseline + ascent; some characters may have an advance width that is greater than their bounding box; and this doesn't even begin to address script-based languages like Arabic!

On top of this, UCS allows two ways to represent accented characters. Most accented characters have a dedicated UCS code point; however, an accented character can also be represented as the code point for the un-accented character, followed by code points for one or more accents as stand-alone symbols. UCS calls symbols which are meant to be applied to the previous character "combining characters," and refers to symbols containing preaccented letters as "precomposed characters."

For example, the symbol Ä can be represented by either the precomposed UCS code point 0xC4 ("Latin capital letter A with diaeresis") or by the code point 0x41 ("Latin capital letter A") immediately followed by code point 0x308 ("combining diaeresis"). And don't forget that there needs to be size and direction between the diaeresis and the letter, and that there can be more than one combining character following a single symbol, including some symbols which can vary their positioning depending on their combination with other combiners!

The UCS took the easy way (or, as some would argue, the sanest way) out of dealing with all the positioning problems of glyphs -- it simply refused to acknowledge their existence. The UCS is simply a symbol table that includes combining characters, nothing more. The UCS also doesn't deal with any of the properties that we assign to specific symbols; for example, it doesn't recognize case. It cannot say that Ä and A are upper-case and a is lower-case, or that Ä and A have the same root letter and differ only by accent, or that a is the same root letter as those two -- they're simply different symbols with no relation. As a result, the UCS isn't very well known, despite the fact that it has existed for over a decade. This is where Unicode comes in.

Unicode originally started out in the late 1980s as an ad-hoc standard agreed on by a group of companies making multi-lingual software products. Initially, Unicode was developed separately from UCS; however, starting in 1991 Unicode merged its code table with UCS, and all versions of Unicode from 1.1 (June 1992) forward match the UCS. Unicode does not define glyph data, or the vectors that are used to render a symbol. However, it does provide lots of normative semantic information that UCS code points lack. For example, a Unicode code point not only contains the UCS symbol, but also data such as the symbol's case (upper/lower/title), category (letter, mark/accent, digit, punctuation, separator, etc.), and numeric interpretations of digit symbols (i.e. the symbol 4 represents four things). Alongside this, we have the Unicode Technical Standards, which define culturally appropriate comparison, sorting, and searching algorithms, character boundaries in script languages, how to handle newlines (CR/LF/CRLF/NEL), and other such handy information.

Let us assume, for now, that given an ordered sequence of Unicode code points, the OS can convert them to glyphs and render them in a way that's appropriate. Of course, this is a huge and almost entirely false assumption -- and I'll be coming back to it later. But it's also a very convenient assumption, because it allows us to reduce the definition of a string down to something that's easy to tackle: a finite ordered sequence of Unicode code points. Of course, in order to store a decimal on a computer, it has to be converted to binary. However, not all binary representations are the same, and not everyone thinks it's worth using 31 bits of information for every character. Tomorrow's episode: encoding systems, and the major character sets that love them.

Today's facts/conclusions:

Strings should be thought of as human-centric, rather than tied to a video card's interpretation of regularly-sized bits.
Strings are composed of glyphs. A glyph consists of a symbol, plus typesetting information.
There's already a standard table called ISO 10646, or UCS, that maps code points (numbers) to symbols. Unicode adds semantics like case, comparison rules, and sorting algorithms to UCS.
Typesetting information is really tricky to store portably. UCS and Unicode ignore its existence.
If the OS can be relied on to handle glyphing, we can store a string as an ordered sequence of Unicode code points.

Oh, and since this is the first post here that's visible to the public -- I'm Ryan Myers, a geek-of-all-trades currently on the Windows Client Performance team. I intend to use this blog as an ongoing set of essays about various facets of programming I've encountered. (I use essays as the textual equivalent of sitting in front of a whiteboard reasoning things out, rather than a polished report of what I wish I had done the first time. So, conclusions may change from post to post, and I welcome all comments and counterpoints.) So, pardon the mess and enjoy the show.

Encodings In Strings Are Evil Things (Part 1)

Additional resources