Unicode

Glossary

Unicode: A fixed-width, 16-bit worldwide character encoding that was developed and is maintained and promoted by the Unicode Consortium, a nonprofit computer industry organization.

We've seen that writing programs that handle Windows double-byte character sets is not as convenient as writing programs that handle single-byte character sets. Part of the complexity stems from the hacked nature of the double-byte scheme. The first code pages weren't designed with expansion in mind. Therefore, every time a new writing system requires support, a new code page (some more complex than others) must be created—and a code page is only the first step toward having something that can be used to create documents.

The importance of using a universal character encoding is demonstrated in the following scenario. Suppose a professor of ancient languages wants to transcribe some Egyptian hieroglyphics electronically so that she can easily share them with colleagues all over the world. The professor convinces Microsoft to create "Windows for Mummies," complete with a new code page and customized applets, keyboard drivers, sorting routines, and a possible API extension. In the time that it would take to develop this new edition of Windows as well as to create useful applications based on it, the hieroglyphics that the professor wants to transcribe might deteriorate from exposure to pollution. Just as the professor feels an urgency about finding an appropriate Windows edition before further work becomes impossible, many users of languages not currently supported by Windows might feel that as they wait for editions they can use, languages continue to die out. Realize that creating a brand- new language edition takes time and is a commitment that must be justified by market demand.

The professor's main reason for using software is that she believes that "Windows for Mummies" will allow a quantum leap in the study of hieroglyphic text. Even if the professor had a hieroglyphic edition of Windows, what if some similarly intentioned colleagues didn't use Windows but rather used an operating system that had a completely different hieroglyphic code-page design? Customized converters would have to translate documents from one code-page standard to the other. What if several colleagues used Windows but did not have the hieroglyphic edition? Windows-based hieroglyphic documents would still be unreadable to them.

In the international arena, the ability to share information from a variety of writing systems in a straightforward manner will be increasingly important, especially for applications such as large databases. Take, for instance, a hypothetical European agency based in Belgium that wants to set up a directory to communicate with its French, Greek, Hungarian, and Russian clients. The company's only computer runs the French edition of Windows 3.1, which is based on the Latin 1, or ANSI, code page for Western European languages. Thus it does not support the Greek and Russian alphabets or certain Hungarian accented characters. Some names will have to be romanized and others will be spelled with whatever characters are available. In the past this might have been acceptable, but in the future it will not be. People want their names to be spelled correctly, and online solutions require them to be spelled consistently. It's difficult to retrieve archived information using a name that has been transliterated in a dozen different ways.

The way Windows 95 handles multiple character sets (discussed in Chapter 6) begins to address this problem. It is a step in the right direction, but it is not the ultimate long-term solution. It supports display of multilingual text through the use of big fonts, but it still relies on the code page model, which creates special issues for multiuser environments such as networks. Network software must keep track of which client is using which code page, and it has to convert text and filenames as files are moved. This is a problem when files are copied back to the first client; character information can be lost during the round-trip. For this reason, Windows 95's networking software uses Unicode to communicate with systems that can understand Unicode.

With all of its limitations, why was the code page model chosen for Windows 95? There were several reasons, including size and memory constraints, an existing code base, and an ambitious product schedule. Another important reason was compatibility with the Central and Eastern European edition of Windows, which supported similar capabilities. Windows 95's multilingual support can be easily added to existing single-byte applications, which is particularly important for Europe.

Windows NT, on the other hand, is a high-end operating system built from the ground up, and it is setting the stage for the future. With Unicode as its character-encoding standard, Windows NT 3.5 can display any data encoded in Unicode as long as it has access to the appropriate fonts. The conversions, mappings, and other complications of multiple code-page scenarios are unnecessary. The next version of Windows NT will combine the extended keyboard and font support of Windows 95 with the Unicode support already available in Windows NT 3.5.