Legacy Arabic Font Encodings

There is a class of legacy fonts for Arabic that use not-well-documented mechanisms and legacy shaping implementations in Windows. This topic provides some details regarding this class of fonts.

These fonts date to the 1990s. These are TrueType fonts, in the sense that they are .ttf files that, in general, follow the TrueType spec. However, they make use of undocumented details not defined in the TrueType or OpenType specs, and do not conform to current specifications for how to implement Unicode fonts for Arabic.

Font encoding and character set declarations

These fonts use Windows Symbol encoding — that is, they have a cmap subtable for platform ID 3, encoding ID 0. By itself, the Windows Symbol encoding implies no interoperable character semantics, but rather implies font-specific semantics. Arabic character semantics are not established in the cmap subtable, as would be expected. Rather, this is done using special values within the OS/2 table that are not documented in the TrueType or OpenType specification.

Note that some of these fonts may also include cmap subtables for other platforms and encodings. Those should not be used, however. The fonts were created in an era in which single-byte encodings were used, and these alternate cmap subtables are likely to be redefinitions of the declared encodings. For example, the Royal Arabic font inclues a Mac Roman cmap subtable, but the glyphs that are mapped do not match the Mac Roman character set.

These fonts include a version 0 OS/2 table. Version 0 includes the fsSelection field, which is a uint16 value with various bits defined as flags. Only a limited set of bits — bits 0 to 6 — are defined; remaining bits are documented as reserved and to be set to 0. In this class of fonts, however, values are set in the reserved, upper word of the fsSelection field. These non-standardized values are the way in which the character semantics are declared.

The values set in the upper word of the fsSelection field are not bit flags, as otherwise used for the fsSelection field. Instead, the upper byte is set to one of the following constant values:

#define ARABIC_CHARSET_SIMPLIFIED   178
#define ARABIC_CHARSET_TRADITIONAL  179

The first of these corresponds to the ARABIC_CHARSET constant defined in gdi32.h for use in the lfCharset member of the LOGFONT structure. The second is used in some fonts, but is not defined in gdi32.h.

Note: There is a correspondence between CHARSET values in Windows GDI and code pages, and code pages are referenced in the ulCodePageRange fields of the OS/2 table (version 1 and later). However, the ulCodePageRange fields are used to indicate logical character sets that are supported in the font, but say nothing about actual character encodings used in the font. For this class of fonts, the ulCodePageRange fields are not relevant.

Note: Similar use of the upper word of the fsSelection field is known to have been used for legacy Hebrew and Thai fonts as well.

The nature of the font encodings

In terms of Unicode characters, the Arabic characters supported by this class of fonts is a subset from the range U+0620 to U+065F.

Each of these legacy font encodings is a presentation-form encoding. That is, all presentation-form glyphs are mapped directly by some character code in the cmap table. The presentation forms will include the basic contextual shapes of Arabic letters: isolate, initial, medial, final. In addition, there are other presentation forms for certain ligatures; these are not documented here in detail.

Note: There is nothing in the fonts themselves that determines the encoding order of text data that might be displayed with these fonts. It is certainly possible that these fonts have been used in combination with some legacy applications that support Arabic content encoded in visual rather than logical order.

As noted above, the fonts use Windows Symbol cmap subtables. In most fonts that use Windows Symbol encoding, the character index values are in the range 0xF020 to 0xF0FF. (In legacy, single-byte applications, these would be mapped to a code point range 0x20 to 0xFF.) In this class of fonts, however, different ranges are assumed for the two encoding declarations:

  • ARABIC_CHARSET_SIMPLIFIED: 0xF100 to 0xF1FF
  • ARABIC_CHARSET_TRADITIONAL: 0xF200 to 0xF2FF

These ranges are reflected in the usFirstCharIndex and usLastCharIndex fields of the fonts’ OS/2 tables.

Unicode to legacy font mappings

Code snippets are provided below that give mappings from Unicode characters to code points in the legacy font encodings. As mentioned, the font encodings are presentation-form encodings. Hence, these are mappings for contextual forms of the Unicode characters. The mapping logically assumes four different contextual shapes for each Arabic letter, though in some cases the same presentation-form code point is mapped for more than one context — for example, one legacy presentation form code point for both isolate and initial contexts.

The following constants are assumed in these code snippets:

#define NUM_ARABIC_LETTER_TABLES  0x0004
#define U_ARABIC_SCRIPT_COUNT     0x40

Legacy Simplified Arabic mapping

The following provides the mapping for ARABIC_CHARSET_SIMPLIFIED:

/************************************************************************************
                  S I M P L I F I E D    A R A B I C
*************************************************************************************/

// index for invalid glyph is 0x00
// These are based on starting index 0xF100
const USHORT cpOldTTFSimpArabicShapes[NUM_ARABIC_LETTER_TABLES][U_ARABIC_SCRIPT_COUNT] =
{
  // Isolate Shapes
  {
    0x00 , 0xad , 0x45 , 0x43 , 0xbb , 0x47 , 0xba , 0x41 ,  // 0x620
    0x4a , 0xa9 , 0x4c , 0x4e , 0x51 , 0x54 , 0x57 , 0x58,

    0x59 , 0x5a , 0x60 , 0x62 , 0x64 , 0x66 , 0x68 , 0x69 ,  // 0x630
    0x6a , 0x6e , 0x72 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x75 , 0x78 , 0x7a , 0x7c , 0x7e , 0xe1 , 0xa4 ,  // 0x640
    0xa5 , 0xac , 0xa8 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Initial shapes
  {
    0x00 , 0xad , 0x45 , 0x43 , 0xbb , 0x47 , 0xae , 0x41 ,  // 0x620
    0x49 , 0xa9 , 0x4b , 0x4d , 0x4f , 0x52 , 0x55 , 0x58,

    0x59 , 0x5a , 0x60 , 0x61 , 0x63 , 0x65 , 0x67 , 0x69 ,  // 0x630
    0x6a , 0x6b , 0x6f , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x73 , 0x76 , 0x79 , 0x7b , 0x7d , 0x7f , 0xa1 ,  // 0x640
    0xa5 , 0xac , 0xa6 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Medial shapes
  {
    0x00 , 0xad , 0x46 , 0x44 , 0xbb , 0x48 , 0xae , 0x42 ,  // 0x620
    0x49 , 0xa9 , 0x4b , 0x4d , 0x4f , 0x52 , 0x55 , 0x58,

    0x59 , 0x5a , 0x60 , 0x61 , 0x63 , 0x65 , 0x67 , 0x69 ,  // 0x630
    0x6a , 0x6c , 0x70 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x74 , 0x77 , 0x79 , 0x7b , 0x7d , 0x7f , 0xa2 ,  // 0x640
    0xa5 , 0xac , 0xa6 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Final shapes
  {
    0x00 , 0xad , 0x46 , 0x44 , 0xbb , 0x48 , 0xaf , 0x42 ,  // 0x620
    0x4a , 0xaa , 0x4c , 0x4e , 0x50 , 0x53 , 0x56 , 0x58,

    0x59 , 0x5a , 0x60 , 0x62 , 0x64 , 0x66 , 0x68 , 0x69 ,  // 0x630
    0x6a , 0x6d , 0x71 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x75 , 0x78 , 0x7a , 0x7c , 0x7e , 0xe1 , 0xa3 ,  // 0x640
    0xa5 , 0xab , 0xa7 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },
};

Legacy Traditional Arabic mapping

The following provides the mapping for ARABIC_CHARSET_TRADITIONAL:

/************************************************************************************
                   T R A D I T I O N A L    A R A B I C
 *************************************************************************************/

// index invalid glyph is 0x00
// These are based on starting index 0xF200
const USHORT cpOldTTFTradArabicShapes[NUM_ARABIC_LETTER_TABLES][U_ARABIC_SCRIPT_COUNT] =
{
  // Isolate shapes
  {
    0x00 , 0xd5 , 0x45 , 0x43 , 0xda , 0x47 , 0xd9 , 0x41 ,  // 0x620
    0x4c , 0xd1 , 0x50 , 0x54 , 0x58 , 0x60 , 0x64 , 0x65,

    0x67 , 0x69 , 0x6b , 0x70 , 0x74 , 0x78 , 0x7e , 0xa2 ,  // 0x630
    0xa3 , 0xaa , 0xae , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0xb2 , 0xb6 , 0xba , 0xbe , 0xc2 , 0xc6 , 0xca ,  // 0x640
    0xcb , 0xd4 , 0xd0 , 0xe7 , 0xe8 , 0xeb , 0xe4 , 0xe5,

    0xea , 0xe9 , 0xe6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Initial shapes
  {
    0x00 , 0xd5 , 0x45 , 0x43 , 0xda , 0x47 , 0xd6 , 0x41 ,  // 0x620
    0x49 , 0xd1 , 0x4d , 0x51 , 0x55 , 0x59 , 0x61 , 0x65,

    0x67 , 0x69 , 0x6b , 0x6d , 0x71 , 0x75 , 0x79 , 0x7f ,  // 0x630
    0xa3 , 0xa7 , 0xab , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0xaf , 0xb3 , 0xb7 , 0xbb , 0xbf , 0xc3 , 0xc7 ,  // 0x640
    0xcb , 0xd4 , 0xcd , 0xe7 , 0xe8 , 0xeb , 0xe4 , 0xe5,

    0xea , 0xe9 , 0xe6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

  },

  // Medial shapes
  {
    0x00 , 0xd5 , 0x46 , 0x44 , 0xdb , 0x48 , 0xd7 , 0x42 ,  // 0x620
    0x4a , 0xd1 , 0x4e , 0x52 , 0x56 , 0x5a , 0x62 , 0x66,

    0x68 , 0x6a , 0x6c , 0x6e , 0x72 , 0x76 , 0x7a , 0xf1 ,  // 0x630
    0xa4 , 0xa8 , 0xac , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0xb0 , 0xb4 , 0xb8 , 0xbc , 0xc0 , 0xc4 , 0xc8 ,  // 0x640
    0xcc , 0xd4 , 0xce , 0xe7 , 0xe8 , 0xeb , 0xe4 , 0xe5,

    0xea , 0xe9 , 0xe6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Final shapes
  {
    0x00 , 0xd5 , 0x46 , 0x44 , 0xdb , 0x48 , 0xd8 , 0x42 ,  // 0x620
    0x4b , 0xd2 , 0x4f , 0x53 , 0x57 , 0x5c , 0x63 , 0x66,

    0x68 , 0x6a , 0x6c , 0x6f , 0x73 , 0x77 , 0x7c , 0xa1 ,  // 0x630
    0xa5 , 0xa9 , 0xad , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0xb1 , 0xb5 , 0xb9 , 0xbd , 0xc1 , 0xc5 , 0xc9 ,  // 0x640
    0xcc , 0xd3 , 0xcf , 0xe7 , 0xe8 , 0xeb , 0xe4 , 0xe5,

    0xea , 0xe9 , 0xe6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },
};