Developing OpenType Fonts for Thai Script

This document presents information that will help font developers create or support OpenType fonts for all Thai script languages covered by the Unicode Standard. Thai script is used to write Thai, as well as other Southeast Asian languages such as Pali and Sanskrit.

Introduction

Font developers will learn how to encode complex script features in their fonts, choose character sets, organize font information, and use existing tools to produce Thai fonts. Registered features of the Thai script are defined and illustrated, encodings are listed, and templates are included for compiling Thai layout tables for OpenType fonts.

This document also presents information about the Thai OpenType shaping engine of Uniscribe, the Windows component responsible for text layout.

In addition to being a primer and specification for the creation and support of Thai fonts, this document is intended to more broadly illustrate the OpenType Layout architecture, feature schemes, and operating system support for shaping and positioning text.

Glossary

The following terms are useful for understanding the layout features and script rules discussed in this document.

Base Glyph - Any glyph that can have a diacritic mark above or below it. Layout operations are defined in terms of a base glyph, not a base character, as a ligature may act as the base.

Character - Each character represents a Unicode character code point. For example, the 'ko kai' character is U+0E01.

Combining Mark - A vowel sign or tone mark, positioned above or below a character to provide pronunciation guidance.

Cluster - The effective "unit" of Thai writing systems, consisting of a consonant, vowel signs, combining tone marks, and independent vowel letters.

Glyph - A glyph represents a form of one or more characters.

Shaping Engine

The Uniscribe Thai shaping engine processes text in stages. The stages are:

  1. Analyze characters for valid diacritic combinations
  2. Shape (substitute) glyphs with OTLS (OpenType Library Services)
  3. Position glyphs with OTLS

The descriptions which follow will help font developers understand the rationale for the Thai feature encoding model, and help application developers better understand how layout clients can divide responsibilities with operating system functions.

Analyze Characters

The unit that the shaping engine receives for the purpose of shaping is a string of Unicode characters, in a sequence. The contextual analysis engine verifies valid diacritic combinations. For additional information, see Invalid Combining Marks.

Shape Glyphs with OTLS

The first step Uniscribe takes in shaping the character string is to map all characters to their nominal form glyphs.

Next, Uniscribe calls OTLS to apply the features. All OTL processing is divided into a set of predefined features (described and illustrated in the Features section). Each feature is applied, one by one, to the appropriate glyphs in the syllable and OTLS processes them. Uniscribe makes as many calls to the OTL Services as there are features. This ensures that the features are executed in the desired order.

The steps of the shaping process are outlined below. Not all of the features listed apply to all Thai script languages.

Shaping features:

  1. Language forms
    1. Apply feature 'ccmp' to preprocess any glyphs that require composition or decomposition.

Position Glyphs with OTLS

Uniscribe next applies features concerned with positioning, calling functions of OTLS to position glyphs.

Positioning features:

  1. Kerning
    1. Apply feature 'kern' to provide pair kerning between base glyphs requiring adjustment for better typographical quality.
  2. Mark to base
    1. Apply feature 'mark' to position diacritic glyphs to the base glyph.
  3. Mark to mark
    1. Apply feature 'mkmk' to position diacritic glyphs to other diacritic glyphs.

Invalid Combining Marks

Combining marks and signs that appear in text not in conjunction with a valid consonant base are considered invalid. Uniscribe displays these marks using the fallback rendering mechanism defined in the Unicode Standard (section 5.12, 'Rendering Non-Spacing Marks' of the Unicode Standard 3.1), i.e. positioned on a dotted circle.

Please note that to render a sign standalone (in apparent isolation from any base) one should apply it on a space (see section 2.5 'Combining Marks' of Unicode Standard 3.1). Uniscribe requires a ZWJ to be placed between the space and a mark for them to combine into a standalone sign.

For the fallback mechanism to work properly, a Thai OTL font should contain a glyph for the dotted circle (U+25CC). In case this glyph is missing from the font, the invalid signs will be displayed on the missing glyph shape (white box).

Illustration that shows the dotted circle character.

In addition to the 'dotted circle' other Unicode code points that are recommended for inclusion in any Thai font isthe ZWSP (zero width space; U+200B). Thai words are not separated by spaces, therefore the ZWSP can be used for word boundaries since its width will 'grow' when justifying text.

If an invalid combination is found, the diacritic that causes the invalid state is placed on a dotted circle to indicate to the user the invalid combination. The shaping engine for non-OpenType fonts will cause invalid mark combinations to overstrike. This is the problem that inserting the dotted circle for the invalid base solves. It should also be noted that the dotted circle is not inserted into the application's backing store. This is a run-time insertion into the glyph array that is returned from the ScriptShape function.

The invalid diacritic logic for Thai is based on the classes listed below. There is a check to make sure more than one mark of a class is not placed on the same base.

Class Description Code points
ABOVE1 Above mark closest to base U+0E31, U+0E34, U+0E35, U+0E36, U+0E37
ABOVE2 Second level above mark U+0E47, U+0E4D
ABOVE 3 Third level above mark U+0E48, U+0E49, U+0E4A, U+0E4B
ABOVE 4 Fourth level above mark U+0E4C, U+0E4E
BELOW1 Below mark closest to base U+0E38, U+0E39
BELOW2 Second level below mark U+0E3A
AM The AM character needs to be broken into two glyphs and some reordering might be required so that the ring is the base glyph U+0E33

Features

The features listed below have been defined to create the basic forms for the languages that are supported on Thai systems. Regardless of the model an application chooses for supporting layout of complex scripts, Uniscribe requires a fixed order for executing features within a run of text to consistently obtain the proper basic form. This is achieved by calling features one-by-one in the standard order listed below.

The order of the lookups within each feature is also very important. For more information on lookups and defining features in OpenType fonts, see the Encoding section of the OpenType Font Development document.

The standard order for applying Thai features encoded in OpenType fonts:
Not all of the features listed below apply to all Thai script languages.

Feature Feature function Layout operation Required
Language based forms:
ccmp Character composition/decomposition substitution GSUB
Positioning features:
kern Pair kerning GPOS
mark Mark to base positioning GPOS X
mkmk Mark to mark positioning GPOS X
[GSUB = glyph substitution, GPOS = glyph positioning]

Feature examples

Character composition (and decomposition)

Feature Tag: "ccmp"

The 'ccmp' feature is used to compose a number of glyphs into one glyph, or decompose one glyph into a number of glyphs. This feature is implemented before any other features because there may be times when a font vender wants to control certain shaping of glyphs. An example of using this table is seen below. The 'ccmp' table maps default alphabetic forms to both a composed form (essentially a ligature, GSUB lookup type 4), and decomposed forms (GSUB lookup type 2).

Table that shows how to use the C C M P feature to decompose the Sara Am for correct mark positioning.
Example: Use the 'ccmp' feature to decompose the Sara Am for correct mark positioning.

Table that shows how to use the C C M P feature to remove the below mark on the Yo Ying character.
Example: Use the 'ccmp' feature to decompose (or remove) the below mark on the Yo Ying character, when the Yo Ying is followed by a below combining vowel mark like the Sara U or Saru UU. The 'mark' feature could then be used for correct positioning of the below vowel mark.

Kerning

Feature Tag: "kern"

The 'kern' feature is used to adjust amount of space between glyphs, generally to provide optically consistent spacing between glyphs. Although a well-designed typeface has consistent inter-glyph spacing overall, some glyph combinations require adjustment for improved legibility. Besides standard adjustment in either horizontal or vertical direction, this feature can supply size-dependent kerning data via device tables, "cross-stream" kerning in the Y text direction, and adjustment of glyph placement independent of the advance adjustment. Note that this feature would not be used in monospaced fonts.

The font stores a set of adjustments for pairs of glyphs (GPOS lookup type 2 or 8). These may be stored as one or more tables matching left and right classes, and/or as individual pairs. If both forms are used, the classes should be listed last, so as to provide a means to replace any non-ideal values that may result from the class tables. Additional adjustments may be provided for larger sets of glyphs (e.g., triplets, quadruplets, etc.) to overwrite the results of pair kerns in particular combinations. These should precede the pairs.

Mark to base positioning

Feature Tag: "mark"

The 'mark' feature positions mark glyphs in relation to a base glyph, or a ligature glyph. This feature may be implemented as a MarkToBase Attachment lookup (GPOS LookupType = 4) or a MarkToLigature Attachment lookup (GPOS LookupType = 5).

Screenshot of a dialog in Microsoft VOLT for specifying positioning adjustments. Anchor attachment is selected as the lookup type. A mark glyph is shown positioned above a base glyph using an anchor point.
Positioning mark to base using Microsoft VOLT

Mark to mark positioning

Feature Tag: "mkmk"

The 'mkmk' feature feature positions mark glyphs in relation to another mark glyph. This feature may be implemented as a MarkToMark Attachment lookup (GPOS LookupType = 6).

Screenshot of a dialog in Microsoft VOLT for specifying positioning adjustments. Anchor attachment is selected as the lookup type. A mark glyph is shown positioned above another mark glyph using an anchor point.
Positioning mark to mark using Microsoft VOLT

Appendix

Appendix A: Writing System Tags

Features are encoded according to both a designated script and language system. The language system tag specifies a typographic convention associated with a language or linguistic subgroup. For example, there are different language systems defined for the Thai script, such as Thai, Kuy, Pali, and Sanskrit.

Currently, the Uniscribe engine only supports the "default" language for each script. However, font developers may want to build language specific features which are supported in other applications and will be supported in future Microsoft OpenType implementations.

  • NOTE: It is strongly recommended to include the "dflt" language tag in all OpenType fonts because it defines the basic script handling for a font. The "dflt" language system is used as the default if no other language specific features are defined or if the application does not support that particular language. If the "dflt" tag is not present for the script being used, the font may not work in some applications.

The following tables list the registered tag names for scripts and language systems.

Registered tags for the Thai script Registered tags for Thai language systems
Script tag Script Language system tag Language
"thai" Thai "dflt" *default script handling
"KUY " Kuy
"PAL " Pali
"SAN " Sanskrit
"THA " Thai

Note: both the script and language tags are case sensitive (script tags should be lowercase, language tags are all caps) and must contain four characters (ie. you must add a space to the three character language tags).