Compatibility Issues in Mixed Environments


  • Mixed environment: A computer environment, usually a network, in which the operating systems of different machines are based on different character encodings.

As with other new technologies, it will take some time before Unicode emerges as the predominant character-encoding standard. Even though the first operating system based on Unicode/ISO 10646—Windows NT—shipped the same year the standard was published, other operating systems are developing and adapting more slowly. There will be a lengthy transition period during which a lot of existing data and older software will be unable to take advantage of Unicode, but this new standard will ultimately revolutionize the way the software industry represents text. The impact of Unicode will be comparable in magnitude to the impact of ASCII 30 years ago.

For this reason, compatibility between Unicode and other character-encoding standards is crucial. Unicode's first 256 characters correspond one-to-one with ANSI, the only exception being the ISO C1 control code range 0x80 through 0x9F. The first 256 Unicode characters have exactly the same layout as that of ISO 8859/1, which served as the basis for the Western European Windows ANSI (or Latin 1) code page. Any character encoded in the Windows code pages, including Far East editions, can be represented in Unicode. Compatibility is achieved through mapping and conversion. For example, Windows NT and Windows 95 carry tables that map characters between Unicode and local code pages.

Because Microsoft Windows NT Workstation and Microsoft Windows NT Advanced Server contain full support for Unicode, and Windows 95 contains only partial support, mixed environments with a Unicode-based server and non-Unicode clients are probable. In such a scenario, data passed between client and server must be converted. Rather than require a Unicode server to understand all possible local code pages, conversion is the responsibility of the client. Each client carries tables that map between its local code page and the corresponding subset of Unicode. There is no need for code-page information to be part of the network protocol.

Whereas it is always possible to convert non-Unicode data to Unicode, it is not always possible to accomplish the reverse. For example, if a client is running the Czech edition of Windows 3.1 (based on the Windows Latin 2 code page), any data it sends to a Windows NT server will be stored in Unicode format. If the Windows NT server then sends the data to a client that's running the Swedish edition of Windows 3.1 (based on the Windows Latin 1 code page), that client will convert as many Latin 2–specific characters as it can and display the remaining characters as default characters. You can call GetCPInfo to determine the default character for a particular code page. In some cases the default character is a question mark, so remember to be careful when mapping a character that might be part of a filename. The file system in Windows uses an underscore as the default character.

Like Windows NT, your software must interact seamlessly in a mixed environment. The basic approach involves adding compatibility features, such as data conversion from old file formats and data interchange with non-Unicode programs.

Not all system services, fonts, and tools will provide a uniform level of Unicode support in the near future. Even Windows NT currently falls short in ways that are sometimes confusing. For example, it allows you to sort text according to the Russian algorithm, but Cyrillic fonts are not part of the system's standard installation—you have to install the Unicode Sans Lucida font manually in order to display Russian text. (The exception to this is the International English edition available in Central and Eastern Europe, which does include Cyrillic fonts.) On the other hand, the Lucida Sans Unicode font contains Hebrew characters, but Windows NT does not currently support the layout of Hebrew text. As well, Windows NT supports the Win32 API entry points for Unicode, but Windows 95 does not. These inconsistencies can be aggravating, but they will gradually disappear, and in the long term, non-Unicode applications will become the bottlenecks that inconvenience users, just as applications that are not DBCS-enabled inconvenience many users today.