Han Unification in the Unicode Standard


  • Han unification: The process of assigning the same code point to characters historically perceived as being the same character but represented as unique in more than one East Asian ideographic character standard. This results in a group of ideographs shared by several cultures and significantly reduces the number of code points needed to encode them.
  • Round-trip conversion: Mapping a character from one character encoding to another and back. Of particular interest is how well information is preserved during round-trip conversion.

The "Han" of "Han unification" refers to the Han Dynasty, when Chinese characters were first systemized. It's also the first character in the name for ideographs, , which is pronounced hanzi, hantsu, or kangxi in Chinese, kanji in Japanese, and hanja in Korean. Because Chinese characters were borrowed by the Japanese and Korean written languages a long time ago, the three languages share many ideographs, though in each language the same character can have different meanings and several different pronunciations. (See Figure 3-11.)

Character Language Word/Form Meaning
Kangxi radical 100

Life, to live

Japanese SEI, SYOO

In kanji coordinates, as in onyomi, meaning life, to live

To be born
    U(mare) Birth, origin
    I(kiru) To live, to exist, to survive
    I(kasu) To revive, to bring to life
    NAMA Raw, uncooked, crude
    KI(no) Pure, neat, genuine
  Chinese SHENG To live, life, livelihood, alive; to be born; to bear a child; to cause; uncooked, raw; unfamiliar, strange; untamed, barbarian; a student; surname

Figure 3-11 An ideographic character used in both Chinese and Japanese.

The Unicode Consortium chose to represent shared ideographs only once because the goal of the Unicode standard was to encode characters independent of the languages that use them. Unicode makes no distinctions based on pronunciation or meaning; higher-level operating systems and applications must take that responsibility. Through Han unification, Unicode assigned about 21,000 code points to ideographic characters instead of the 120,000 that would be required if the Asian languages were treated separately. (See Figure 3-12.) Even though the character in the above Figure 3-11 has a number of pronunciations and meanings, only one character is encoded in Asian national standards. It's true that the same character might look slightly different in Chinese than in Japanese, but that difference in appearance is a font issue, not a "uniqueness" issue.

Figure 3-12 National standards for these languages encode the han character with a distinct code point. Unicode considers this to be one and only one character. Therefore, in the process of Han unification, only one code point was allocated. In undertaking Han unification, the Unicode Consortium worked closely with experts from various countries, including China, whose GB 13000 standard encompasses all the characters of several Asian standards, according to The Unicode Standard, Volume 2.

Some ideographs look very similar but actually have unique meanings and might be drawn with a different stroke order. Such characters generally have separate codes in the Asian national standards. Other characters can be variants of one another, having a slightly different appearance but the same meaning. If characters are assigned separate codes in the Asian national standards, Unicode assigns them separate codes. Preserving these distinctions provides a framework for simple round-trip mappings between Unicode and various national standards. Unicode also separately encodes about 2000 Simplified Chinese characters used in the People's Republic of China.