Two Phonetic Scripts: Vietnamese and Korean

I just visited two very interesting countries, Vietnam and Korea. Being actively involved in writing software (mostly RichEdit) for editing the world’s scripts, I was naturally fascinated to see Vietnamese and Korean text displayed in profusion. The Vietnamese and Korean scripts were designed with a common purpose in mind: enable the languages to be read and written easily by all members of their respective countries. Earlier on, people tried to write Vietnamese and Korean by customizing the Chinese script. But while the Chinese script is well suited to Chinese languages, it’s considerably less suited to Vietnamese and Korean. Accordingly only a small percentage of the Vietnamese and Korean people were able to read and write their languages using the Chinese script.

In Vietnam in the 1500’s and 1600’s, Portuguese and French missionaries wanted to be able to read and write Vietnamese and to communicate with the Vietnamese people in writing as well as verbally. To this end, they chose a Latin alphabetic script with the letters a..z, đ, â, ă, ê, ô, ơ, ư plus the corresponding upper-case letters and five tone marks   ̀  ́  ̃  ̉  ̣ (acute, grave, tilde, hook, and dot below, defined in the Unicode U+0300 block) for a total of 134 characters. This alphabetic script represented the Vietnamese language phonetically. The traditional Chinese orthography continued to be dominant until the early 1900’s, when the alphabetic script took over. In Vietnam today you still see Chinese characters, but mostly on old buildings and manuscripts. The vast majority of Vietnamese text uses the alphabetic script. All 134 characters were encoded in Unicode 1.1 (June, 1993). Initially people used 8-bit code pages such as 1258 to encode the Vietnamese characters. But since Unicode has all the characters, it’s much more efficient to use them.

At the moment, Windows uses a vestige of the 1258 approach in that the tone marks are encoded as combining marks in the U+0300 block instead of using the fully composed characters. This requires complex-script shaping, which slows down the display. Admittedly shaping engines can perform other useful tasks such as kerning and ligature formation, yielding finer typography. And a Vietnamese tone mark applies to a whole syllable, so it doesn’t have to be placed where a fully composed vowel has it. But web sites such as Wikipedia use the fully composed Unicode characters and I suspect that Microsoft will do so too eventually. Back in 1998 I recommended using the fully composed characters when we enhanced RichEdit 3.0 to handle Vietnamese. But the folks back then wanted to stick with Microsoft’s Vietnamese keyboard, which required entering the tone marks explicitly and the method didn’t convert the combining-mark sequences to the fully composed characters. Other input methods, e.g., Telex and VNI, have slicker ways to enter Vietnamese characters, have become popular and insert fully composed characters. VNI’s option of automagically inserting the accents is particularly intriguing.

While foreign missionaries were responsible for the Vietnamese script, King Sejong of Korea was responsible for the Korean script. His motivation was essentially the same as the European missionaries’: make it easy for all Koreans to read and write their language. His original script published in 1446 had only 24 characters, called jamo, as shown in the following picture taken of an interactive display in the National Palace Museum of Korea in Seoul.


Modern Korean requires more: 19 initial consonants (C), 21 vowels (V) and 27 final consonants (T). The final consonants include most of the initial consonants and add some others. The jamo are displayed in boxes called Hangul syllables. There are 19×21 CV combinations and 19×21×27 CVT combinations for a total of 11172 possible Hangul syllables in modern Korean. The jamo are encoded in the Unicode U+1100 block (C—U+1100..U+1112, V—U+1161..U+1175, T—U+11A8..U+11C2) and the 11172 Hangul syllables are encoded from U+AC00..U+D7A3 in CVT sort order (T varies fastest, C varies slowest).

If you look at the Unicode U+1100 block, you’ll notice it’s full: 256 jamo! That’s more than 19 + 21 + 27. The major difference is the inclusion of many old Hangul jamo that are not used in Modern Korean. Modern Korean can be handled as a simple script: just use the Hangul symbols for which no glyph shaping is needed. In contrast, Old Hangul has many more combinations and needs to have a shaping engine to place the jamo correctly. The Unicode Standard explains how to do this in Chapter 3, Section 3.12 Conjoining Jamo Behavior.

Some interesting Unicode Hangul history. Noncombining jamo (U+3130..U+318F) and 2350 Hangul syllables (U+3400..U+3D2D) were part of Unicode 1.0 (October, 1991). Unicode 1.1 (June, 1993) added the modern combining jamo (U+1100 block) and 4306 more Hangul syllables. The Korean government wanted the remaining 11172 – 4306 – 2350 = 4516 syllables of Modern Korean to be added as well and preferably to collect all the syllables in a single block. I had just joined the Unicode Technical Committee (over 20 years ago!) and it seemed to us to be a shame to have the Hangul syllables split up into three blocks. Furthermore, Unicode wasn’t yet used for Korean anywhere as far as we could tell. Windows NT had support for Unicode, but nothing special for Hangul. Other operating systems didn’t even support Unicode at that time. Word processing programs that supported Korean used a Korean code page, not Unicode. S.G. Hong of the Microsoft Korean subsidiary pleaded for us to use a single block and after considerable deliberation the UTC and WG2 (the ISO 10646 working group on character sets) elected to do so. Hence in Unicode 2.0 (July, 1996) the two earlier Hangul blocks were deprecated and the Hangul syllables were assigned U+AC00..U+D7A3 in the ideal alphabetic order. To this day, no one has come up with a Korean document that was compromised by these changes. But you should have heard the outcries of folks that were upset that the old codes were deprecated.

Ever since then, Unicode code points have been completely stable and such stability is a basic requirement. Shortly after the release of Unicode 2.0, Word 1997 was released. Based on Unicode, it supported the modern Hangul syllables. At that point it would have been unthinkable to change the code points since documents actually existed that used the code points. Fortunately we were able to make the changes early enough in Unicode’s history that Korea enjoys excellent Unicode support. Couldn’t help but think of that a bit while walking through the streets and palaces of beautiful downtown Seoul.