Describe Your Data

Article
06/30/2006

Charlie Heinemann
Microsoft Corporation

July 20, 1999

I must admit that over the past year I've had mixed feelings concerning this whole validation thing. The well-formedness part I get. A few simple rules, when followed, allow a parser to understand where a tag begins and where it ends, what's a comment and what's text. The validation part, though, is a bit trickier. It does provide some obvious benefits, allowing you to describe your data and to define relationships within that data -- but document type definitions (DTDs) and XML schemas can be a bit foreign, and their purpose in many situations easily questioned. On top of all this, sorting out whether to use DTDs or schemas (not to mention which flavor of schemas) can be difficult.

The point of what follows is to clear up some of the questions that arise once you've moved past well-formedness and on to validation.

Why Describe My Data?

I say describe, rather than validate, because the term "validate" implies that the function of a DTD or schema is to validate the structure of your data. Validation is one function of a DTD or schema; it isn't, however, the only one. DTDs and schemas can also define data types and relationships within your data, something that can be useful even if validation seems unnecessary, so I prefer to say "describe" when it comes to explaining the function of DTDs and schemas.

Which leads me back to my original question of why describe? Because data authors across the Web need a way to understand the structure of the data that can be processed by your application. Data authors need something that tells them how you expect the data to look both when they receive it and when they send it back to you. For an example of how validation can be used within applications, check out my article on Internet Explorer 5 support for XML validation.

Why else? Because describing your data in more detail can give the consumers of that data the information that will greatly assist in processing it. For example, by providing data type and ID/IDREF information, you can relieve the data consumer of having to do type conversions (check out my data-typing article for more details), and can increase their performance when navigating to related nodes (see my article on ID/IDREF navigation).

What Is a DTD?

DTDs are a method of providing "a grammar for a class of documents." This is the method of data description described within the World Wide Web Consortium (W3C) XML 1.0 Specification. Rather than go into the details (which can be found within the XML 1.0 spec), I'll give you a brief rundown of the relevant facts concerning DTDs:

DTDs describe XML Documents
DTDs can be used to validate XML documents and define ID/IDREF relationships
DTDs employ a funky syntax where angle brackets, exclamation points, white space, parentheses, question marks, and asterisks are used to define which elements and attributes can go where and the contents they can contain.
DTDs are supported within the Internet Explorer 5 parser.

The following is an example of a DTD:

<!ELEMENT B (#PCDATA)>
<!ELEMENT A (B)>
<!ATTLIST A c CDATA #REQUIRED>

The following XML element would be a valid instance according to the above DTD:

<A c="foo">
  <B>9</B>
</A>

What is an XML Schema?

An XML schema is used to describe XML elements and attributes (as opposed to XML documents). It basically consists of attribute and element type declarations that describe content models for XML elements and attributes within an instance document.

It serves much of the same purpose as a DTD. However, its functionality extends beyond that of a DTD.

The MSXML parser, released with Internet Explorer 5 and as a re-distributable shortly after, contains support for the XML schemas described within XML-Data Reduced (a subset of the XML-Data proposal to the W3C):

<A c="foo">
  <B>9</B>
</A>

How Are XML Schemas Different from DTDs?

The big question becomes "When do I use XML schemas, and when do I opt for DTDs?" As when choosing any technology over another, you have to decide which gives you the most value both in the short term and over the long haul. To make such a decision, you need to know where the two technologies differ.

Instance Syntax

XML schemas are XML documents. Unlike DTDs which have their own peculiar syntax, XML schemas are written in XML. This provides the user with three benefits.

First, you don't have to know two syntaxes to author a well-formed XML schema. Granted, you do have to learn the grammar rules for XML-Data Reduced. But, you don't have to worry about two sets of well-formedness rules.

Second, tools support can take advantage of the common syntax between XML documents and schemas to provide support for each. Just as it is easier for you to know only one syntax; it's easier for you to build in support for one rather than two syntaxes. The support built into the parser for navigating XML documents, for instance, can also be used to navigate schemas. Unfortunately, DTDs cannot be navigated in the same fashion.

Third,being XML documents, XML schemas can be extended. You can add elements and attributes to XML schemas just as you would any XML document. As long as the elements and attributes are of a different namespace, they are legal within a schema.

Data Typing

DTDs allow you to type content only as a string. XML schemas allow you to type content as ints, floats, dates, Booleans, or a number of other simple data types. In the example above, the B element described by the schema contains an integer, while the B element described by the DTD contains a string. If you were then to write an application to process the contents of that element and required that the value of that element be an integer, you would have to convert it to an integer in the DTD case, whereas you could access that value directly as an integer in the schema case.

Open Content Model

XML schemas allow for an open content model, meaning that you can extend XML documents while not violating the validity constraints. Take the following XML:

<item>
  <name>TG/DT Latte</name>
  <quantity>1</quantity>
  <price>2.00</price>
</item>

The following DTD and schema could be used to describe this instance:

DTD:
<!ELEMENT name (#PCDATA)>
<!ELEMENT quantity (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT item (name,quantity,price)>

schema:
<ElementType name="name"/>
<ElementType name="quantity" dt:type="int"/>>
<ElementType name="price" dt:type="fixed.14.4"/>
<ElementType name="item" model="open">
  <element type="name"/>
  <element type="quantity"/>
  <element type="price"/>
</ElementType>

If you validate the above XML using the schema, you can add element children to the item element and it will still be valid, provided the elements added are valid within the context of their own namespace:

<item xmlns:myItm="urn:myItems">
  <name>TG/DT Latte</name>
  <quantity>1</quantity>
  <price>2.00</price>
  <myItem:time>10:21 PDT</myItem:time>
</item>

If you were to validate this element using the above DTD, you would get a validation error because "myItem:time" is not defined within the DTD.

Namespace Integration

XML schemas integrate namespaces, allowing you to associate particular nodes in an instance document with type declarations within a schema. The only way to associate XML nodes to a DTD is through the DOCTYPE declaration. This is limiting, because only one DTD can be used per instance document. Multiple XML schemas may be used to described a single XML document, because the XML schema doesn't describe the XML document itself, but XML elements within it.

Who's Doing What with Schemas?

Biztalk at http://www.biztalk.org/ is supporting XML schemas based on the XML-Data Reduced syntax. The eventual goal is to migrate to the W3C XML Schema standard, but they are progressing toward that by supporting the XML schemas that can be parsed by the MSXML parser. For more information on BizTalk, see XML: The Buzz on BizTalk by Robert Hess.

In addition to the support within the MSXML parser, tools vendors, such as Extensibility are coming out with tools that support XML-Data Reduced schemas. An example is XML Authority 1.0, which allows you to start working with XML schemas.

The gist of the schema activity is that you can begin to seriously consider using XML schemas today. The tools support is here, and will continue to grow, and industry support is strong. The benefits of schema and the ability to increase those benefits through extensibility make them a good choice when describing your XML documents.

Charlie Heinemann is a program manager for Microsoft's XML team. Coming from Texas, he knows how to think big.