Legacy Arabic Font Encodings

There is a class of legacy fonts for Arabic that use not-well-documented mechanisms and legacy shaping implementations in Windows. This topic provides some details regarding this class of fonts.

These fonts date to the 1990s. These are TrueType fonts, in the sense that they are .ttf files that, in general, follow the TrueType spec. However, they make use of undocumented details not defined in the TrueType or OpenType specs, and do not conform to current specifications for how to implement Unicode fonts for Arabic.

Font encoding and character set declarations

These fonts use Windows Symbol encoding — that is, they have a cmap subtable for platform ID 3, encoding ID 0. By itself, the Windows Symbol encoding implies no interoperable character semantics, but rather implies font-specific semantics. Arabic character semantics are not established in the cmap subtable, as would be expected. Rather, this is done using special values within the OS/2 table that are not documented in the TrueType or OpenType specification.

Note that some of these fonts may also include cmap subtables for other platforms and encodings. Those should not be used, however. The fonts were created in an era in which single-byte encodings were used, and these alternate cmap subtables are likely to be redefinitions of the declared encodings. For example, the Royal Arabic font includes a Mac Roman cmap subtable, but the glyphs that are mapped do not match the Mac Roman character set.

These fonts include a version 0 OS/2 table. Version 0 includes the fsSelection field, which is a uint16 value with various bits defined as flags. Only a limited set of bits — bits 0 to 6 — are defined; remaining bits are documented as reserved and to be set to 0. In this class of fonts, however, values are set in the reserved, upper word of the fsSelection field. These non-standardized values are how the character semantics are declared.

The values set in the upper word of the fsSelection field are not bit flags, as otherwise used for the fsSelection field. Instead, the upper byte is set to one of the following constant values:

#define ARABIC_CHARSET_SIMPLIFIED   178
#define ARABIC_CHARSET_TRADITIONAL  179

The first of these corresponds to the ARABIC_CHARSET constant defined in gdi32.h for use in the lfCharset member of the LOGFONT structure. The second is used in some fonts, but is not defined in gdi32.h.

Note: There is a correspondence between CHARSET values in Windows GDI and code pages, and code pages are referenced in the ulCodePageRange fields of the OS/2 table (version 1 and later). However, the ulCodePageRange fields are used to indicate logical character sets that are supported in the font, but say nothing about actual character encodings used in the font. For this class of fonts, the ulCodePageRange fields are not relevant.

Note: Similar use of the upper word of the fsSelection field is known to have been used for legacy Hebrew and Thai fonts as well.

The nature of the font encodings

In terms of Unicode characters, the Arabic characters supported by this class of fonts is a subset from the range U+0620 to U+065F.

Each of these legacy font encodings is a presentation-form encoding. That is, all presentation-form glyphs are mapped directly by some character code in the cmap table. The presentation forms will include the basic contextual shapes of Arabic letters: isolate, initial, medial, final. In addition, there are other presentation forms for certain ligatures; these are not documented here in detail.

Note: There is nothing in the fonts themselves that determines the encoded nature of text data that might be displayed with these fonts. Some legacy applications might generate documents using these fonts with contextual forms of Arabic letters represented directly in the documents. Some legacy applications might generate documents with Arabic text encoded in visual rather than logical order. Other legacy applications might encode Arabic text in logical order using one character code for Arabic letters, with the software resolving visual order and selecting contextual forms and ligatures from the font using character codes of the presentation-form encoding.

As noted above, the fonts use Windows Symbol cmap subtables. In most fonts that use Windows Symbol encoding, the character index values are in the range 0xF020 to 0xF0FF. In this class of fonts, however, different ranges are assumed for the two encoding declarations:

  • ARABIC_CHARSET_SIMPLIFIED: 0xF100 to 0xF1FF
  • ARABIC_CHARSET_TRADITIONAL: 0xF200 to 0xF2FF

These ranges are reflected in the usFirstCharIndex and usLastCharIndex fields of the fonts’ OS/2 tables.

Note: In legacy, single-byte applications, these 16-bit ranges would be mapped to a single-byte code point range 0x20 to 0xFF in documents. For example, when ARABIC_CHARSET_SIMPLIFIED is set in the font’s OS/2 table, the code 0x45 in a document would be mapped to 0xF145 when searching in the font’s cmap table.

Legacy encoding details

This section documents the semantic interpretation of code points in these legacy encodings. This is expressed as a mapping from Unicode characters to corresponding legacy codes for different contextual forms (if relevant).

Legacy Traditional Arabic encoding

The mapping from Unicode to the legacy Traditional Arabic encoding is provided in two tables:

  • The first table covers Arabic letters, which have joining presentation form variants. The legacy encoding includes ligature forms for certain Arabic letter sequences as separate characters; separate rows are included in the table for each letter sequence. The column headings indicate whether there are left or right connecting strokes: "initial" implies a left connection, "medial" implies left and right connections, "final" implies a right connection, and "isolate" implies no connections.

  • The second table covers other Arabic-script characters and joining controls. This includes Arabic combining marks. The legacy encoding includes ligature forms for certain mark combinations as separate characters; separate rows are included for each mark combination.

For most marks or mark combinations, the legacy encoding has high and low positional variants encoded as separate characters. (This is similar to use of the 'mset' OpenType feature to substitute positional-variant glyphs, except in the legacy encoding the substitution is done for pre-determined cases and handled by character codes.) In the second table, these positional-variant characters are listed in the same row.

Arabic letters and letter sequences (ligatures)

Unicode Legacy encoding
Initial Medial Final Isolate
0621   hamza F2D5
0622   alef with madda above F246 F245
0623   alef with hamza above F244 F243
0624   waw with hamza above F2DB F2DA
0625   alef with hamza below F248 F247
0626   yeh with hamza above F2D6 F2D7 F2D8 F2D9
0627   alef F242 F241
0628   beh F249 F24A F24B F24C
0628 + 062C   beh jeem F280
0628 + 062D   beh hah F281
0628 + 062E   beh khah F282
0628 + 0631   beh reh F215
0628 + 0645   beh meem F296 F202
0628 + 0646   beh noon F292
0628 + 064A   beh yeh F21D
0629   teh marbuta F2D2 F2D1
062A   teh F24D F24E F24F F250
062A + 062C   teh jeem F283
062A + 062D   teh hah F284
062A + 062E   teh khah F285
062A + 0631   teh reh F216
062A + 0645   teh meem F297 F203
062A + 0646   teh noon F293
062A + 064A   teh yeh F21E
062B   theh F251 F252 F253 F254
062B + 0645   theh meem F204
062C   jeem F255 F256 F257 F258
062C + 0645   jeem meem F29A
062D   hah F259 F25A F25C F260
062D + 0645   hah meem F29B
062E   khah F261 F262 F263 F264
062E + 0645   khah meem F29C
062F   dal F266 F265
0630   thal F268 F267
0631   reh F26A F269
0632   zain F26C F26B
0633   seen F26D F26E F26F F270
0633 + 0645   seen meem F218
0634   sheen F271 F272 F273 F274
0634 + 0645   sheen meem F219
0635   sad F275 F276 F277 F278
0636   dad F279 F27A F27C F27E
0637   tah F27F F2F1 F2A1 F2A2
0638   zah F2A3 F2A4 F2A5 F2A6
0639   ain F2A7 F2A8 F2A9 F2AA
063A   ghain F2AB F2AC F2AD F2AE
0641   feh F2AF F2B0 F2B1 F2B2
0641 + 064A   feh yeh F29F
0642   qaf F2B3 F2B4 F2B5 F2B6
0643   kaf F2B7 F2B8 F2B9 F2BA
0644   lam F2BB F2BC F2BD F2BE
0644 + 0622   lam alef madda above F2E1 F2E0
0644 + 0623   lam alef hamza above F2DF F2DE
0644 + 0625   lam alef hamza below F2E3 F2
0644 + 0627   lam alef F2DD F2DC
0644 + 062C   lam jeem F286 F212
0644 + 062D   lam hah F287 F213
0644 + 062E   lam khah F288 F214
0644 + 0644 + 0647   lam lam heh F201
0644 + 0645   lam meem F29D F205
0644 + 0645 + 062C   lam meem jeem F211
0644 + 0645 + 062D   lam meem hah F210
0644 + 0647   lam heh F21A
0644 + 0649   lam alef maksura F295
0644 + 064A   lam yeh F21C
0645   meem F2BF F2C0 F2C1 F2C2
0645 + 062C   meem jeem F289
0645 + 062D   meem hah F28A
0645 + 062E   meem khah F28B
0645 + 0645   meem meem F29E
0646   noon F2C3 F2C4 F2C5 F2C6
0646 + 062C   noon jeem F28C
0646 + 062D   noon hah F28D
0646 + 062E   noon khah F28E
0646 + 0645   noon meem F298 F206
0646 + 064A   noon yeh F21F
0647   heh F2C7 F2C8 F2C9 F2CA
0648   waw F2CC F2CB
0649   alef maksura F2D3 F2D4
064A   yeh F2CD F2CE F2CF F2D0
064A + 062C   yeh jeem F28F
064A + 062D   yeh hah F290
064A + 062E   yeh khah F291
064A + 0631   yeh reh F217
064A + 0645   yeh meem F299
064A + 0646   yeh noon F294

Other characters and character sequences (ligatures)

Unicode Legacy encoding
060C   Arabic comma F20C
061B   Arabic semicolon F23B
061F   Arabic question mark F23F
0640   tatweel F25F
064B   fathatan F2E7 (high) or F2F5 (extra high)
064C   dammatan F2E8 (high) or F2F6 (extra high)
064D   kasratan F2EB (low) or F2F9 (extra low)
064E   fatha F2E4 (high) or F2F2 (extra high)
064F   damma F2E5 (high) or F2F3 (extra high)
0650   kasra F2EA (low) or F2F8 (extra low)
0651   shadda F2E9 (high) or F2F7 (extra high)
0651 + 064B   shadda fathatan F2EE (high) or F2FC (extra high)
0651 + 064C   shadda dammatan F2EF (high) or F2FD (extra high)
0651 + 064D   shadda kasratan F2FF
0651 + 064E   shadda fatha F2EC (high) or F2FA (extra high)
0651 + 064F   shadda damma F2ED (high) or F2FB (extra high)
0651 + 0650   shadda kasra F2F0 (high) or F2FE (extra high)
0652   sukun F2E6 (high) or F2F4 (extra high)
0660   Arabic-Indic digit zero F230
0661   Arabic-Indic digit one F231
0662   Arabic-Indic digit two F232
0663   Arabic-Indic digit three F233
0664   Arabic-Indic digit four F234
0665   Arabic-Indic digit five F235
0666   Arabic-Indic digit six F236
0667   Arabic-Indic digit seven F237
0668   Arabic-Indic digit eight F238
0669   Arabic-Indic digit nine F239
066B   Arabic decimal separator F25E
200C   zero width non-joiner F20C
200D   zero width joiner F20D
200E   left-to-right mark F20E
200F   right-to-left mark F20F

Legacy Simplified Arabic encoding

The mapping from Unicode to the legacy Simplified Arabic encoding is presented by means of a code snippet. Four arrays are used for isolate, initial, medial and final forms. This mapping logically assumes four different contextual shapes for every Arabic character, though in some cases the same presentation-form code point is mapped for more than one context — for example, one legacy presentation form code point for both isolate and initial contexts.

Note: This mapping data does not reflect ligatures for Arabic letter combinations that are directly encoded in the legacy Simplified Arabic encoding.

#define NUM_ARABIC_LETTER_TABLES  0x0004
#define U_ARABIC_SCRIPT_COUNT     0x40

/************************************************************************************
                  S I M P L I F I E D    A R A B I C
*************************************************************************************/

// index for invalid glyph is 0x00
// These are based on starting index 0xF100
const USHORT cpOldTTFSimpArabicShapes[NUM_ARABIC_LETTER_TABLES][U_ARABIC_SCRIPT_COUNT] =
{
  // Isolate Shapes
  {
    0x00 , 0xad , 0x45 , 0x43 , 0xbb , 0x47 , 0xba , 0x41 ,  // 0x620
    0x4a , 0xa9 , 0x4c , 0x4e , 0x51 , 0x54 , 0x57 , 0x58,

    0x59 , 0x5a , 0x60 , 0x62 , 0x64 , 0x66 , 0x68 , 0x69 ,  // 0x630
    0x6a , 0x6e , 0x72 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x75 , 0x78 , 0x7a , 0x7c , 0x7e , 0xe1 , 0xa4 ,  // 0x640
    0xa5 , 0xac , 0xa8 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Initial shapes
  {
    0x00 , 0xad , 0x45 , 0x43 , 0xbb , 0x47 , 0xae , 0x41 ,  // 0x620
    0x49 , 0xa9 , 0x4b , 0x4d , 0x4f , 0x52 , 0x55 , 0x58,

    0x59 , 0x5a , 0x60 , 0x61 , 0x63 , 0x65 , 0x67 , 0x69 ,  // 0x630
    0x6a , 0x6b , 0x6f , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x73 , 0x76 , 0x79 , 0x7b , 0x7d , 0x7f , 0xa1 ,  // 0x640
    0xa5 , 0xac , 0xa6 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Medial shapes
  {
    0x00 , 0xad , 0x46 , 0x44 , 0xbb , 0x48 , 0xae , 0x42 ,  // 0x620
    0x49 , 0xa9 , 0x4b , 0x4d , 0x4f , 0x52 , 0x55 , 0x58,

    0x59 , 0x5a , 0x60 , 0x61 , 0x63 , 0x65 , 0x67 , 0x69 ,  // 0x630
    0x6a , 0x6c , 0x70 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x74 , 0x77 , 0x79 , 0x7b , 0x7d , 0x7f , 0xa2 ,  // 0x640
    0xa5 , 0xac , 0xa6 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },

  // Final shapes
  {
    0x00 , 0xad , 0x46 , 0x44 , 0xbb , 0x48 , 0xaf , 0x42 ,  // 0x620
    0x4a , 0xaa , 0x4c , 0x4e , 0x50 , 0x53 , 0x56 , 0x58,

    0x59 , 0x5a , 0x60 , 0x62 , 0x64 , 0x66 , 0x68 , 0x69 ,  // 0x630
    0x6a , 0x6d , 0x71 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,

    0x5f , 0x75 , 0x78 , 0x7a , 0x7c , 0x7e , 0xe1 , 0xa3 ,  // 0x640
    0xa5 , 0xab , 0xa7 , 0xc7 , 0xc8 , 0xcb , 0xc4 , 0xc5,

    0xca , 0xc9 , 0xc6 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 ,  // 0x650
    0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00,
  },
};