The Code-Page Model

Article
02/06/2008

Glossary

ANSI: (1) Acronym for the American National Standards Institute. (2) The Microsoft Windows ANSI character set, essentially ISO 8859/x plus additional characters, which was originally based on an ANSI draft standard.
ASCII: Acronym for American Standard Code for Information Interchange, a 7-bit code that is the United States (US) national variant of ISO 646.
Encoding: A system of assigning numeric values to characters.
Code point, or code element: (1) The minimum bit combination that can represent a unit of encoded text for processing or exchange. (2) An index into a code page.
Extended characters: (1) Characters above the ASCII range (32 through 127) in Windows-based single-byte character sets. (2) Accented characters.

Once upon a time, PCs spoke mainly English and a few Western European languages. Microsoft's operating system at that time, MS-DOS, supported 256 characters. Each character was represented by a unique, 1-byte numeric value. Take a look at a picture of code in Appendix H: you will see 26 letters of the English alphabet (both uppercase and lowercase forms), punctuation marks, Greek letters, line-drawing characters (which allow MS-DOS–based applications to draw boxes), and a few accented characters.

As computers became more widely used and new languages had to be accommodated, support for code pages that included different accented characters, which are an integral part of many languages, were added to MS-DOS. In each of these code pages, the set of characters numbered 32 through 127 (hex 0x20 through 0x7F) were identical, and formed the 7-bit set called ASCII. Every code page supported English and any other language that uses the same Latin alphabet, such as Hawaiian and Indonesian. The characters numbered 128 through 255 (hex 0x80 through 0xFF) were called extended characters and varied from code page to code page. The set of extended characters determined which other languages the code page could support.

More MS-DOS code pages appeared with the Arabic, Far East, and Russian editions of MS-DOS. Every time a new language script required character support, a new code page was created. Although the MS-DOS code pages seem limited to us now, they were an improvement over the 7-bit ASCII standard, which didn't include any accented characters at all.

Windows 95 also uses code pages, but not the same ones as MS-DOS. Windows doesn't need to bother with the line-drawing characters that MS-DOS supports, so beginning with Windows 1 the code page designers replaced those characters with publishing characters. The character set that both Windows 3.1 and Windows 95 use to support Western European languages is referred to as Latin 1, or ANSI¹. The local code pages are pictured in Appendix H. As with the MS-DOS character sets, characters numbered 32 through 127 are the same for each single-byte code page.

1. The character sets that Windows 95 uses were introduced with Windows 3.1. This book will refer to these as Windows, or local, code pages

The Code-Page Model

Additional resources