Encodings in Strings are Evil Things (Part 3)

    (Before I start: I've gotten a few suggestions about readability, since my two entries thus far have been quite long. So, entries will now contain a summary at the end with major facts/conclusions, and I'll go back and add them for the first two posts. I'll also try to pace my paragraphs more regularly. Thanks for the advice!)

   Yesterday, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for encoding and decoding code point indices on a binary computer.  At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode code points.  Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition.

   Before we do that, though, there's one more nasty digression into standards-land that I'd like to take.  This is a fairly general definition of what a string is, and you don't really write libraries unless you intend for them to be general-purpose enough to be reused.  So, it might be a worthwhile goal to make our new string library compatible with the string class in the C++ Standard Template Library, so that anyone could gain its benefits simply by using a different #include.  Unfortunately, there's some restrictions that the C++ Standard (which I would highly suggest purchasing if you code in C++ for a living -- it's $18 in PDF form direct from ISO) which prevent us from doing so -- namely, that many parts of basic_string are hard-wired to require a constant-size encoding and will not work with encodings such as UTF-8.

   The C++ Standard starts by defining basic_string as templated on three classes -- a character type (charT), a specialization of char_traits for that type, and an allocator for that type.  (Nothing SAYS we have to implement it with exactly those template parameters, but we're screwed anyways, as you'll see.)  It then defines two static typedefs for that specialization: traits_type, which typedefs to the templated traits specialization, and value_type, which typedefs to traits_type::value_type... which, by definition, is also required to be charT.  The definition of char_traits requires that char_traits be specialized only on PODs (which are always constant-size), and its definitions all are written to assume uniformly-sized characters.

   If the traits problem wasn't enough, on top of that, a conformant basic_string implementation requires that s[i] return the same value as s.data()[i], and data is required to return a const charT *.  So, even if we could get around the traits problem, variable-length encodings still screw us because operator[] and a pointer offset will no longer agree.

   So, we will have to abandon hopes of being a drop-in replacement for basic_string.  But, really, this isn't too bad -- there's only three other libraries in the STL that require the use of basic_string!  The first is in locale, and hardly anyone uses C++'s built-in locales anyways, favoring OS functionality.  The second is the bitset container, which hardly anyone uses either.  The third is its use as a backing store for stringstreams and as the stringbuf wrapper that is the foundation of iostream, and this is a bigger loss.

   The loss of direct compatibility with stringbuf is a big pain.  However, when you're getting to I/O, you need to have already converted your string to the encoding your user is expecting -- we shouldn't expect a prompt expecting ASCII to be able to deal with a stream of UCS-2 characters!  So, it's perfectly okay if stringbuf is left alone, as long as we find a way to convert strings between different encodings.  So, stringstreams are the only real loss, and we can make our own stringstream, if need be.  (Thanks to templates, we may be able to avoid having to re-invent the wheel, which is always good.)

   I'm going to start with policy-based design, which Alexandrescu introduced a few years ago in Modern C++ Design.  (Actually, the STL beat him to the punch by using allocators as a template argument for most of its containers, but he popularized its use for general customization.) In fact, he already demonstrated policy-based design in a CUJ article a year or two ago by making a basic_string replacement that allowed customizing copy-on-write semantics -- but I'm a bit more ambitious :)

   My first stab at the class will be based directly off our most recent definition of string -- an encoding, and an ordered sequence of bits:

namespace rmlibs {

namespace encodings {
/* ... utf8, iso8859_1, big5, mac_roman, etc. go here ... */

namespace backing_stores {
/* ... string_literal, vector_of_uchars, etc. go here ... */

template <class Encoding, class Bits> class rmstring {
typedef Encoding encoding_type;

Bits _data;


   Not much, but it's a start!

   At this point, I want to reference something I said earlier about I/O -- when you're doing I/O, whether that's taking a string in or sending a string out, your stream of bits needs to have the same encoding as the device you're talking with, or Bad Things happen.  We need some way to denote, inside code, that an encoding change needs to take place.  (Guessing ahead, this will probably be the most tedious part of development -- creating UCS-to-encoding and encoding-to-UCS transitions for each encoding and character set we support.)  I'm going to take a nod from the excellent Boost library here, and make an analogue to their lexical_cast class.

namespace rmlibs {

// these are the major exceptions...

   class missing_symbol;
   class malformed_data;

// ... that are thrown by:

template <typename Target, typename Source> Target encoding_cast(Source str);

   In the near future I'll probably alter this to take only rmstrings as input and output and template on encoding types in/out, since right now it accepts any pair of types -- but this is only a prototype.

   The goal for doing this is to minimize conversions.  Some of my coworkers who have been kind enough to proofread have remarked, "I'd just throw up my hands and convert everything internally to UCS-4 and use a basic_string<unsigned long>; after all, memory is cheap."   In a way, they're right -- doing this would mean I'd only have to write encoding_cast() for each encoding, and not even need the new string class.  But, I'm a performance guy, a bit twiddler at heart.  I don't want to do a conversion unless I need to, or if the performance gains from a fixed-width format like UCS-4 outweigh the performance loss of having to trans-code everything.

   (It's rather like image formats -- TGA is lossless and can hold damn near anything, but that doesn't mean we always convert everything to TGA first before working with it, and then convert back when we're done.  Not everything has to be "worked on," and not all work is equally difficult.  This is especially true if we're using a compile-time string literal as a backing store, since it won't be modifiable unless you make a copy!)

   The general plan is to use rmstring as a Facade pattern for the Encoding class we're templated on.  Most of rmstring's methods will actually call the Encoding class and pass in state and a pointer to our Bits object as needed; the Encoding class will handle all the work of character traversal.  Since many of the encodings we're planning to deal with are fixed-width (UCS-2, UCS-4, and most old systems like ISO 8859 and ASCII), I'll likely create a FixedWidthEncoding base class that does most of the work of locating offsets and insertion/deletion, and inherit most of the Encodings from it.  This means, the main thing that will be unique for each Encoding will be the translation tables used for converting the symbol sets for non-Unicode systems to Unicode code points, since most of the older encodings are simple fixed-width affairs and just have non-standard symbol sets.

   Tomorrow, we'll start fleshing out rmstring's body with constructors and methods, and explain what those two exceptions next to encoding_cast are for.  We'll also take a brief look at screen-readers and web browsers, and make a change to encoding_cast to handle "looks-close-enough" trans-codes.

Today's facts/conclusions:

  • The definitions of basic_string and char_traits in the C++ Standard prevent use of variable-width encodings; therefore, we cannot make a perfect drop-in replacement for the STL string class.  However, that's okay -- the only STL object we'll have to duplicate functionality for is stringstream.

  • We can't expect I/O with external devices/programs to conform to whatever encoding we want -- they're expecting a specific encoding, and we need to present our data in that format -- or die a horrible, painful death.  So, the ability to trans-code is absolutely necessary.

  • Trans-coding can be expensive, but can have some gains, especially if going to UCS-4 for speed in manipulation or going to UTF-8 for compatibility with legacy C APIs.  Do it when necessary or justified, but avoid it if it's not absolutely necessary.  The coder should be allowed to pick an encoding and work with strings in that encoding as easily as possible.