Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided
Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings. Best fit can be interesting, but often its not a good idea. In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag. In .Net you can use the EncoderFallback to control whether or not to get Best Fit behavior. Unfortunately in both cases best fit is the default behavior. In Microsoft .Net 2.0 best fit is also slower.
The underlying problem that best fit behavior tries to solve is "Gee, Unicode has about a gajillion more characters than 1252, how do we get them all in?". Unfortunately that's the problem, they won't all fit in. 1252 has 256 characters, the nearly 100,000 Unicode characters just won't fit. So what Best Fit tries to do is to cram as many characters as possible into the limited set of the code page by mapping them to things they might look like. So c with a dot above, ċ, U+010b is mapped to a plain old c with no dot. Japanese full-width forms are mapped to their half width forms, etc. There are lots of problems with this solution, and I'll mention some of them here:
- The mappings are somewhat random and sometimes bizarre. The infinity symbol, ∞, U+221e, is mapped to 8. Sure it looks like a sideways 8, but its sideways, and its meaning is very different.
- The mappings are somewhat random and inconsistent between code pages. In some code pages Japanese fullwidth forms are "best fit" to the non full-width form, in others they are not.
- The best fit behavior has not been updated in years, so new code points aren't present, so c, ć U+0107 c with acute, ĉ U+0109 c with circumflex, ċ 0x010b c with dot above, č 0x010d c with caron and ｃ U+ff43 fullwidth c, are all mapped to c in code page 1252. However ƈ U+0188 c with hook, ɕ U+0255 c with curl, с U+0441 Cyrillic es, ḉ U+1e09 c with cedilla and acute above and others are not mapped and turn into ?. Also, ç U+00e7 c with cedilla doesn't change since it has its own character in 1252.
- Many mappings lead to security holes. A common test for ., and other characters to prevent .. style attacks on paths fail if fullwidth forms are used and not tested for. Since fullwidth forms are often mapped, any English string, like a user name or password can also have multiple variations, leading to security holes. Even if fullwidth forms are considered other mappings with diacritics as mentioned in the previous bullet exist for common English characters.
- Most of the best fit mappings in our tables were thought of by English speaking Americans and could be culturally inappropriate for other locales.
- ü and u aren't the same character. Düssledorf has the alternate spelling Duessledorf, replacing the ü with ue, not u. In languages that use diacritics the pronunciation of the character changes. If you made mailing labels for your customers would you really want to change their name? Best case the spelling looks stupid and the customer thinks "gee, these guys have an old computer too". Worst case you turned their name to crap... literally. In that case ? would probably be better, at least your customer would probably understand it was a computer limitation [:)]
- For typical English US spellings UTF-8 is exactly as space efficient as ASCII or 1252. So if you use UTF-8 you won't need best fit and it won't cost you anything. In .Net and Windows Vista UTF-8 is also much faster than 1252 or ASCII. Even in other languages UTF-8 is faster and for most languages it doesn't even create significantly larger file sizes. A small price to pay to insure that you don't corrupt your data.
- Best fit doesn't even help some of the alternate English spellings, such as the ae, those just turn into ? anyway.
- As I alluded to above, frankly our best fit mapping is pretty poor, even if that behavior is desirable. We're inconsistent with the behavior for different code pages, we haven't added new characters, and we've made some strange decisions.
Its also worth noting that there are a few rare cases where best fit can happen when decoding data with MultiByteToWideChar or Encoder.GetString or the Decoder class.
For both Windows and Microsoft .Net, the best plan is to use Unicode when possible, either UTF-8 or UTF-16 is usually a good choice. Sometimes its not possible, usually because of a protocol limitation. Often best fit behavior is a poor choice when hitting a protocol problem, since such protocols are usually explicit and such mappings could cause security holes or protocol violations. In those cases finding extensions or newer protocols that handle Unicode are good, but some, like e-mail headers [;)], we're stuck with.
In Windows you can disable the best fit behavior by using the WC_NO_BEST_FIT_CHARS flag. In the framework you can do so by changing the EncoderFallback and DecoderFallback. Encoder.GetEncoding(xxx, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback) or ExceptionFallback are good choices. Note that in the .Net 2.0 there is no "best fit" fallback, except for an internal best fit fallback that is used by default, so once you change a class's EncoderFallback or DecoderFallback you cannot easily retrieve the best fit fallback.
If you are aware of the limitations of the fallbacks and want consistent behavior anyway, one option to consider is making your own fallback. I made a prototype fallback that uses Normalization to decompose a string to its component parts. This is particularly nifty because characters can be decomposed to their component parts. By doing this, things like the kPa symbol can change to k + P + a. It still doesn't work across all languages though since ü would still become a u instead of a ue in German. So even though this can be a fun experiment, it's still better to Use Unicode!