Encodings in Strings are Evil Things (Part 4)

   In our last episode, we established that we wouldn't be able to make a true std::string replacement and still handle variable-width encodings.  So, we started with the beginning lines of an rmstring class.  However, this doesn't mean we are going to dispense with std::string entirely!  But first, a quick answer about my choice of names and an explanation about exceptions.

   A friend of mine asked me yesterday, "Don't you intend to make a basic_rmstring and then have a typedef'd rmstring that hardwires a specific specialization, like ASCII?"  I'm considering this -- but if I hardwire anything, it will not be the encoding type.  Trying to abstract away the encoding as hidden information is exactly the thinking that got us into this mess with std::string!  However, what we use for the backing store might be worth standardizing.  After all, using a vector<byte> to contain our bitstream is a very flexible choice; it's just not the best-performing one.  Whenever possible, we should make a library easy to use on the surface, and expose the guts of it to be changed once someone already has the program running and is trying to improve on it (by, for example, using string literals as backing stores and only copying them to heap memory when needed.)

   In a dream world, we would typedef a partial specialization.  However, we get bit by one of the most annoying mis-features in C++ -- you can't template a typedef.  Even the STL is crippled by this, and has to work around it using its ::rebind member.  So, the best we could do is allow someone to #define rmstring(enc) basic_rmstring<enc, vector_of_bytes>, and declare a string as rmstring(iso8859_1) str;..  It'd work, but it makes me cringe.  Alternately, we could use a rebind approach like the STL: 

template <class Enc> struct rmstring {
   typedef basic_rmstring<Enc, vector_of_bytes> type;

rmstring<iso8859_1>::type str;

   Really, both of them are pretty damned ugly; the preprocessor approach is prettier, IMHO, but is also considerably more dangerous.  So, I'm going to leave it as rmstring with two template values for the purposes of this blog.  Eventually I'll probably opt for the #define for my own version of the library, but you can choose whichever is more appealing to you (conciseness versus typesafety), or choose neither.

   The second thing I wanted to answer from yesterday were those two exceptions, missing_symbol and malformed_data, that I listed next to the encoding_cast() function.  What on earth are they for?  First off, imagine that you're trying to convert a string from UCS-4 to UCS-2.  As I mentioned in Part 2, UCS-2 is a non-universal encoding, and there are some code points that it cannot represent.  What happens if our UCS-4 string contains one of those code points?  In this case, we will throw the missing_symbol exception.  We will also throw it in the case of converting to legacy character sets that simply do not have a code point defined for a symbol.

   There's something to keep in mind, though.  The popularity of JPEG proves that a lossless transform is not always necessary.  Imagine that we have the greek letter Æ -- is it acceptable to convert this to two characters, AE?  The proper answer is neither yes or no;it's "sometimes."  Remember, all this time, our definitions of string have been derived from a definition of symbols that a human interprets -- and this means that whether or not a 'close enough' translation is acceptable depends on who's looking at the string.  Imagine that a blind person is using a screenreader (a program that uses a computerized voice to read text as it appears on the screen).  In that case, there's a vast difference between Æ and AE.   However, for a person with normal sight reading a webpage, however, the two might be interchangable.

   The computer scientist in me says that I should only allow lossless transforms -- the engineer in me knows better, though, and there's a way to satisfy both.  Therefore, we are going to add a third template argument to yesterday's definition of encoding_cast, and allow it to have a default specialization.  This default specialization will be called the "symbol clash resolver" and has a well-known method invoked whenever a missing symbol problem occurs.  The default one, lossless_resolver, will throw missing_symbol in all cases.  A user can define alternatives, though.

   Two possible alternatives immediately occur to me -- one called visual_parity_resolver that does replacements like the above, and another called error_symbol_resolver that acts like RS232's error character, inserting a compile-time constant instead (such as a box symbol, or an "<ERROR>" string, or whatever suits the user) whenever a symbol cannot be translated.  But those can all wait for later -- only lossless_resolver needs to be immediately defined, and its definition is trivial, since it just throws :)

   The other exception, malformed_data, comes from if we try to decode a buffer that has an error in it.  In the case of UTF-8, there are sequences that decode to illegal or nonsensical numbers, and if we are asked to decode these sequences, we should let the user know.  Imagine a scenario where you are writing an Internet server daemon, and expect to recieve a UTF-8 encoded string as the first transmission following a client successfully connecting.

   In this scenario, we recv() the data from the server into a buffer, and then construct an rmstring<utf8, unmanaged_pointer> to read it.  If there was an error in network transmission, or a malicious client was testing our ability to handle bad data, we should communicate this to the programmer as an error.  Thus, if an encoding can detect illegal input (very few encodings can!) it may throw a malformed_data exception if you invoke any operations that hit that input, or if you attempt to trans-code it.  We will also probably want to make a compile-time flag visible on the encoding class that determines whether or not it can have malformed data.

   So, with those two issues resolved, let's get down to our dirty business!

   I said earlier that we had to pick one of two mutually exclusive goals: be a perfect drop-in replacement for std::string, or support variable-width encodings such as UTF-8.  Since I think std::string is poorly designed and I demonstrated that not being string-compatible is only a loss for stringstream compatibility, I'm favoring the latter.  (Just hating std::string alone would not be sufficient reason -- in that case I'd just be suffering from NIH syndrome.)

   However, this doesn't mean that I can just go roll my own string class in the way that best suits my urges.  Many programmers have devoted considerable time and energy to learning std::string's ins and outs, myself included -- so, I should exploit that knowledge by providing similar functions with similar arguments, as long as it doesn't compromise my design's principles.

   Looking at basic_string's definition in the C++ Standard is an exercise in mental stamina.  It defines six constructors (one of which requires some very special trickery with templating and the SFINAE principle to implement, as we'll see later) and over 100 methods, plus a host of non-member operators such as << and +.  However, looking at the expected behavior for each function, most of them are overloads that call a base function.

   In other words, a basic_string has one or two core definitions at most for each core method (such as append(), replace(), insert(), etc.), which take basic_strings as their input.  Every other overload is defined as equivalent to calling that root function, with a basic_string constructor meant to convert some other form of string (char pointer, run of chars, pair of iterators, etc.) to a basic_string that the "core implementation" can grok.

   Of course, they don't all implement them like that, because it'd mean frivolously making a copy of the input data in basic_string form for each trivial overload.  Instead, a typical implementation of std::string has an optimized version for each variant, making maintenance a nightmare.  But we don't have that problem -- because, instead of requiring an STL allocator, we can accept an arbitrary backing store!  So, suppose we have a working implementation of append:

template < class Encoding, class BackingStore > class rmstring {
// Appends n codepoints of str, starting at pos, to the string.
// * Will throw an out_of_range exception if pos >= str.length()
// * If pos is in range, but pos + n > str.length(), n is truncated so that pos + n = str.length().
// * Will throw an length_error exception if the resulting string would be larger than BackingStore's max_size().

template < class OtherBS > rmstring & append( rmstring<Encoding, OtherBS> const & str, size_type pos, size_type n ) {
       /* implementation */

   (Note that I've defined the above in terms of code points, not symbols.  There can be multiple codepoints representing a single symbol.  I'll discuss this problem, and the related problem of Unicode normalization forms, in a later post -- namely because I'm still working on a solution.  :-P This is a learning exercise for me too!)

   Because OtherBS is arbitrary, we can directly implement the other overloads of append() as calls to append() with a rmstring constructor, without worrying about needlessly duplicating information.  If we want to use a char * from an ANSI C function, we can just use a unmanaged_pointer backing store.  If we want to use n repetitions of some character c, we can just use a run_of_chars<n, c> backing store.  We pass the exact same information as if we were doing it the old way, but abstracted inside a templated class, so there's no overhead except at compiletime.  Beautiful!

   So, what should we implement from std::string?  Here's the core functions from basic_string that seem worth carrying over:

  • Size functions: size() and length(), max_size(), capacity(), reserve(), resize(), empty(), clear()

  • Iterators: begin(), end(), rbegin(), rend()

  • Accessors: operator[], at()

  • Replacers: assign(), operator=

  • Appenders: push_back(), push_front(), append(), operator+=, operator+

  • Modifiers: insert(), erase(), replace()

  • Searchers (evil): find(), rfind(), find_first_of(), find_last_of(), find_first_not_of(), find_last_not_of()

  • Utilities: substr(), copy(), swap()

  • Comparators (also evil): compare(), operator==, operator!=, operator<, operator>, operator<=, operator>=

  • Streams: operator<<, operator>>

  • Backwards compatibility: c_str(), data()

   That's a lot of stuff to implement!  But not only does it gain us good-will by allowing programmers to code much like they did with std::string, it also means that we can make a typedef rmstring<RMS_COMPILER_SPECIFIC_ENCODING, vector_of_bytes> rstring, and be pretty damned close to std::string-equivalent.  (The compiler-specific encoding can be set in a header file, or specified on the command line -- I'll likely set it to iso8859_1 for string and ucs2 for wstring in a header.)

   But before I get to that, I'll have a nastier problem to tackle, and that's combining characters.  Not only do we have codepoints that can take up variable amounts of space (thanks to encoding), but we also have symbols that can take up variable amounts of codepoints!  (See Part 1 and search for "diaeresis" if you're not sure why this is.)  Unicode, luckily, comes to the rescue again with a standard that determines when and how a character symbol or should not be broken down into combining characters.  These are called normalization forms, and we'll tackle those on Monday.

   Next episode: Normalization forms and chain of command (which does not involve rmstring covering its ass if things go FUBAR).

Takeaways from Part 4:

  • We're specifically designing rmstring to force the programmer into awareness of encodings -- we don't want to hide that with a basic_rmstring being typedefed.  (We couldn't anyways, because we can't template typedefs.)  So, for now, we'll leave it as-is.

  • Not only are all encodings inequal, not all trans-coding schemes are equal either!  Be aware of this, and think about how you want to handle errors!

  • Even if we think std::string is evil, we can still gain good will from our potential users by making ourselves as close to std::string as possible.  This, unfortunately, means lots of work.  But not as much as if we were actually implementing std::string, due to our luck in choosing to template our backing store.

  • However, all our methods need to be defined in terms of symbols, not code points (and certainly not bytes of encoded data!).  This makes our life difficult again.