XML Files: XML Encoding, DTDs and Namespaces, Binary Data, Namespace Identifiers, and More

Article
02/26/2008

XML Encoding, DTDs and Namespaces, Binary Data, Namespace Identifiers, and More

Aaron Skonnard

Q When I load my XML document, I get this error: "Switch from current encoding to specified encoding not supported." Why?

A This error has to do with an inconsistency between the character encoding specified in the XML declaration and the actual character encoding used to serialize the XML document. All characters in XML come from the Universal Character Set (UCS), which associates a numerical code point with each character. Many algorithms, also known as character encodings, exist for converting code points into a sequence of bytes. A specific character encoding must be used to serialize an XML document. For example, if you type an XML document into Notepad and save it, you can choose from one of several supported character encodings including ISO-8859-1, UTF-8, or UTF-16.
According to the XML 1.0 specification, all processors are required to automatically support (and detect) the UTF-8 and UTF-16 encodings. If you use one of these two encodings when serializing your documents, you don't need an XML declaration (unless you need to specify version or standalone information):

  <?xml version="1.0" encoding="UTF-8"?> <!-- optional -->
<foo/>

      If you use an encoding other than UTF-8/UTF-16, then you must use an XML declaration to specify the actual encoding used. This presents the standard chicken-or-the-egg problem: how can the processor read the encoding information without knowing what encoding was actually used?
      It's easy for processors to auto-detect between UTF-8/UTF-16 with or without an XML declaration by looking for a byte order mark (BOM) required in UTF-16 documents. For all other encodings, you know that the first five characters must be "<?xml". Since a given processor will only support a finite set of encodings, a brute-force algorithm can be used that simply looks at the first few bytes to determine the family of the character encoding used (there are five possible encoding families including UTF-16 big endian, UTF-16 little endian, UCS-4 or other 32-bit encoding, EBCDIC, and everything else). Once the processor detects the encoding family, it can read the rest of the XML declaration (since only a restricted set of characters can be used in the XML declaration), then it can switch to using the specified character encoding within the detected family. If, at this point, the XML declaration tells the processor to switch to an encoding from a completely different family, that error occurs.
      In other words, any time you save a document using a specific encoding, then indicate a different encoding in the XML declaration, you will get this error because the encodings must agree.

Q I'm confused about when the version/encoding values are required in the XML declaration. In some cases it seems like the encoding is optional, but then in other cases I get an "Invalid xml declaration" error when I leave it off.

A When you use an XML declaration in a document entity, the version information is required while the encoding information is optional. This is codified by production 23 of the XML 1.0 specification (note: a question mark (?) following a token means that the token is optional. Question marks inside of quotes are part of a string literal):

  XMLDecl::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

Any of the following XML declarations is acceptable when used in a document entity (but not in external entities):

  <?xml version='1.0'?>
<?xml version='1.0' standalone='yes'?>

You can also use a similar declaration in external entities, only it's no longer called an XML declaration but rather a text declaration (see production 77 of the XML 1.0 specification):

  TextDecl::= '<?xml' VersionInfo? EncodingDecl S? '?>'

In this case, the version information is optional while the encoding information is now required. So the following XML declaration would be acceptable when used in an external entity (but not the document entity):

  <?xml encoding='utf-16'?>

In other words, the following XML document and the referenced external DTD appropriately use the XML declaration and the text declaration, respectively:

  *** foo.xml ***
<?xml version='1.0'?>
<!DOCTYPE foo SYSTEM "foo.dtd">
<foo/>
_** foo.dtd **_
<?xml encoding='UTF-8'?>
<!ELEMENT foo EMPTY>

Q How do I write a DTD for a namespace-aware document?

A The DTD is a relic from the SGML world that was developed long before namespaces. Therefore, DTDs know nothing about namespace declarations, namespace scoping, or how to deal with changing namespace prefixes. So to make this work, you must hardcode all namespace information into the DTD using standard element or attribute declarations as follows:

  <!DOCTYPE f:foo [
   <!ELEMENT f:foo EMPTY>
   <!ATTLIST f:foo xmlns:f CDATA #REQUIRED>
]>
<f:foo xmlns:f='urn:foo'/>

Obviously, this forces you to always use the same namespace prefix, which in most cases isn't desirable. To get around this constraint, you must parameterize the namespace prefix in the DTD through a parameter entity. Then, in the instance document you can specify the prefix currently being used by overriding the parameter entity declaration. Doing so can be a complicated task, however, because of restrictions on where you can legally use parameter entities. Figure 1 and Figure 2 illustrate this process using a more practical invoice document.

Q I need to send a binary data stream within an XML document. I figured that I could put the binary data inside of a CDATA section, but that doesn't seem reliable because the binary data could easily contain "]]>".

A The designers of XML didn't intend for users to embed binary data within the XML document itself—unparsed entities and notations were designed for this purpose. Unparsed entities allow you to include the URI to the binary resource along with a notation that serves as the resource's type identifier:

  <!DOCTYPE foo [
  <!ENTITY mypic SYSTEM https://foo.com/aaron.jpg NDATA jpg>
  <!NOTATION jpg "urn:mime:img/jpg">
  <!ELEMENT foo EMPTY>
  <!ATTLIST foo img ENTITY #REQUIRED>
]> 
<foo img="mypic"/>

XML processors don't do anything with unparsed entities or notations except feed the information to the consuming application. It's up to the application itself to download the resource and process the binary data stream appropriately. If you find unparsed entities or notations cumbersome or simply don't want to rely on Document Type Definitions, you could easily provide the same information with schema-specific elements or attributes. For example, consider the following document:

  <person>
  <name>Aaron</name>
  <img 
   type="urn:mime:img/jpg">https://foo.com/aaron.jpg</img>
</person>

If the schema for person specifies that the <img> element and type attribute are both of type URI reference, you've essentially accomplished the same result. Again, it's up to the application to download the referenced resource and process it appropriately.
If you absolutely must include the raw binary data inside of the XML document, the most common solution is to base64-encode the data prior to serialization within element content. Simply wrapping the raw binary stream in a CDATA section isn't reliable since not all bytes are valid XML characters. For example, bytes 0x0-0x1F (except CR, LF, TAB) aren't legal in XML documents at all, plus you have the CDATA end marker problem that you mentioned in your question. If you base64-encode the binary data, you avoid all of these problems since it doesn't produce any illegal or markup characters. So assuming the JPG image file has been base64-encoded, you could serialize the person document as shown in the following code:

  <person>
  <name>Aaron</name>
  <img>base64encodedimagegoeshere</img>
</person>

Q Do I have to be connected to the Internet to use a namespace identifier? When are they resolved?

A Namespace identifiers are simply opaque strings for uniquely identifying a namespace—they serve the same purpose as GUIDs in COM. Namespace identifiers happen to use the URI syntax for uniquely identifying resources. Since a namespace identifier is a URI, both URLs and URNs may be used. When developers first see namespace URIs in the form of URNs

  <foo xmlns="urn:uuid:4acc3490-b729-4e93-b4da-530506a6e1cc"/>

they typically don't get confused on this issue because they look like unique identifiers. But when developers see namespace URIs in the form of a URL they immediately feel compelled to resolve the URL.

  <foo xmlns="https://develop.com/schemas/soap"/>

In the XML space, however, namespace identifiers are simply opaque strings that conform to the URI specification that may or may not be dereferencable.
Schema authors may choose to put schema documentation at the end of a namespace URI, but that has nothing to do with how namespace URIs are interpreted by a processor. Note that this practice has not been standardized by the W3C, and it's a rather hotly debated topic. However, it could assist developers working with the schema.

Q How do I create my own namespaces? When do I use one of the namespaces at the W3C as opposed to my own?

A It is the responsibility of the schema developer to define the schema's namespace identifier (XML Schema at https://www.w3.org/XML/Schema.html finally codifies this practice). So when you're using a schema that someone else defined, you must use his or her namespace identifier. When you're developing your own schemas, it's your responsibility to define the namespace identifiers.
Processors or applications built for a specific schema hardcode the schema's namespace identifier into its functionality. Processors determine the role and type of each element and attribute by inspecting the local name + namespace URI. For example, if you're using an XSLT 1.0 processor, you must specify the XSLT 1.0 namespace as defined by the W3C:

  <xsl:transform  
 xmlns:xsl="https://www.w3.org/1999/XSL/Transform" 
 version="1.0">
•••
</xsl:transform>

If one character is off or uses the wrong case, the XSLT 1.0 processor won't recognize this document as a valid transformation.
When you're generating your own schemas, it's up to you to define the namespace identifier that the rest of the world should use. Whether you choose to use URLs or URNs is a matter of preference, but you must guarantee their uniqueness. Organizations can guarantee the uniqueness of URLs by using their DNS-registered domain name (guaranteed to be unique) followed by a path string that must only be unique for the given domain—organizations typically come up with a standard scheme for doing this. As you can see, the W3C distributes their namespace identifiers using a fairly standard convention:

  https://www.w3.org/1999/XSL/Transform
https://www.w3.org/1999/XML/xinclude
https://www.w3.org/2000/10/XMLSchema
•••

You can do the same thing with URNs. The syntax for a URN is:

  <URN> ::= "urn:" <NID> ":" <NSS>

NID is the namespace ID which, like a domain name, must also be unique. This is followed by a namespace-specific string (must only be unique for this NID). To guarantee that the NID is unique, you must register it with IANA, the central authority for registering Internet names. Then, as with URLs, you're responsible for guaranteeing uniqueness within the NID.

Q I am confused about which XSLT namespace to use with MSXML 3.0. Which one do you recommend?

A MSXML 3.0 supports two different versions of XSLT. It supports the version based on the December 1998 XSL Working Draft (https://www.w3.org/TR/1998/WD-xsl-19981216.html) as well as the final version of XSLT 1.0. These are the namespace identifiers for each version, respectively:

  https://www.w3.org/TR/WD-xsl
https://www.w3.org/1999/XSL/Transform

So to answer your question, you should use the namespace identifier for the version of XSLT that you're targeting. XSLT 1.0 is much more powerful than the original working draft version, so you would virtually always target it today unless you're writing transformations that must work on machines that only run Microsoft® Internet Explorer 5.0 (which didn't necessarily ship with MSXML 3.0, so you can't be sure) or MSXML 2.0.

Q What is the difference between XSL and XSLT?

A According to the W3C XSL page (https://www.w3.org/Style/XSL), XSL is a language used to express stylesheets. It consists of XSL Transformations (XSLT), which is the language used for transforming XML documents, and an XML vocabulary that is used to specify formatting semantics (XSL Formatting Objects).
In the beginning, there was just a single specification called XSL that defined both parts plus XPath's predecessor, called XSL Patterns. As it became obvious that portions of the specification were useful independently, it was factored into several distinct specifications: XPath, XSLT, and FO. Today the term XSLT refers specifically to the transformation programming language, while the term XSL is often used loosely to refer to the formatting language (FO), although strictly speaking it refers to the combination of both languages.

Q When I execute my XSLT, all of the text in the source document is output. Why does this happen?

A XSLT defines several built-in template rules that are considered part of every stylesheet even if they aren't explicitly defined. The two interesting built-in templates are shown here:

  <!-- XSLT built-in template rules-->
<xsl:template match="*|/">
  <xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()|@*">
  <xsl:value-of select="."/>
</xsl:template>

The first built-in template allows recursive processing to continue if no match was found for nodes specified by calling xsl:apply-templates. The second built-in template simply outputs the value of matched text and attribute nodes.
For example, consider the following transform that contains a single template:

  <xsl:transform 
 xmlns:xsl='https://www.w3.org/1999/XSL/Transform'   
 version='1.0'>
  <xsl:template match='/'>
    <output><xsl:apply-templates select='node()'/></output>
  </xsl:template>
</xsl:transform>

When the previous transform is applied to this document

  <person>
  <name>
    <first>Aaron</first>
    <last>Skonnard</last>
  </name>
  <age>82</age>
</person>

the output will be (ignoring whitespace):

  <output>AaronSkonnard82</output>

After the explicit call to xsl:apply-templates, the processor tries to match the specified nodes (the root's children—in this case, person). If a match isn't found, the first built-in template executes and the processor recursively matches person's child nodes, and so on. Since in this case there are no other explicit templates, the only templates that match are the built-in templates. Once the processor recursively gets to the text nodes, the second built-in template executes placing the text node values in the output.
If you don't like this default behavior, you can always add explicit copies of these built-in templates that do something different. For example, adding these templates to a stylesheet overrides the default behavior by not allowing recursive processing or the output of text and attribute nodes:

  <xsl:template match="*|/">
</xsl:template>
<xsl:template match="text()|@*">
</xsl:template>

Or, you can simply make sure that when you call xsl:apply-templates you only select nodes for which templates exist.

Send questions and comments for Aaron to xmlfiles@microsoft.com.

Aaron Skonnard is an instructor and researcher at DevelopMentor, where he develops the XML curriculum. Aaron coauthored Essential XML (Addison-Wesley Longman, 2000) and wrote Essential WinInet (Addison-Wesley Longman, 1998). Get in touch with Aaron at https://staff.develop.com/aarons.

From the May 2001 issue of MSDN Magazine

Additional resources