Pitfalls of Chinese Conversion (Part 2)

We have talked about Kernel32.dll and its LCMapString API in the previous entry. In addition, I have shown you how to use the API to convert Simplified Chinese character to Traditional Chinese character or vice versa with sample codes provided.

If we perform a simple test on LCMapString API, we may find the limitation of the API. Let’s illustrate the limitation by running Chinese Converter application:

We type the Traditional Chinese phrase “頭髮” (means “hair of head” in English) in the left text field of the WinForms application, and convert it to its corresponding Simplified Chinese characters.

The characters “头发” are displayed in the right text field. It does convert the characters correctly!

Let’s clear the text fields and make another test; We type the Simplified Chinese characters “头发” in the right text field and convert it to Traditional Chinese characters.

It shows “頭發” in the left text field now. It does not convert back to the original Traditional Chinese string “頭髮” as expected!

The conversion mistake is due to the mapping relationship between Traditional and Simplified Chinese is not exactly one to one. (although it is true for most of the cases!) Multiple Traditional Chinese characters may map to a single Simplified Chinese character!

In the 1950s, Mainland China began using Simplified Chinese characters to help increase literacy. Simplified character forms were created by decreasing the number of strokes of Traditional Chinese characters. Most of the simplifications are based on popular cursive forms embodying graphic or phonetic simplifications of the traditional forms. However, some of them were simplified irregularly. Of course, there are still a large portion of the characters were not simplified, and are thus identical between the Traditional and Simplified Chinese orthographies.

Japan also simplified a number of Kanji (Chinese characters) used in the Japanese language half century ago from Kyujitai Kanji (Traditional Chinese). The new forms are called Shinjitai Kanji. The Kanji simplification in Japan in general has a lesser extent comparing to the simplification of Chinese in Mainland China. As the simplification is taken separately in Mainland China and Japan, some of the Kanji used in Japan now are neither ‘traditional nor 'simplified'.

All of those characters (Traditional Chinese / Kyujitai Kanji + Simplified Chinese / Shinjitai Kanji) code points are included in the Unicode standard during the Han Unification process! This was rendered necessary by the fact that the linkage between simplified characters and traditional characters is not exactly one-to-one.

This also means the existing method of machine conversion between Simplified Chinese and Traditional Chinese may have some mistakes. (Although it may incur less mistakes if we convert the characters from Traditional Chinese to Simplified Chinese) If the system were intelligent enough to translate sentences using the context, the number of mistakes would be reduced!