From the July 2002 issue of MSDN Magazine

MSDN Magazine

A Quick Guide to XML Schema-Part 2

Download the code for this article:TheXMLFiles0207.exe (40KB)

In the April 2002 issue I covered the basics of XML Schema, including how you define elements, attributes, complex types, and how to use the built-in data types. In this month's column I'll introduce the more powerful features of the language, which are related to defining custom types and type hierarchies.
Last time, I covered the process of describing a document's structure through xsd:element, xsd:attribute, and xsd:complexType definitions. As I mentioned in that earlier column, XML Schema defines a set of built-in data types that may be used to constrain the text content of elements and attributes.
Picking up where I left off, each XML Schema data type has an explicitly defined value space as well as an explicitly defined lexical space (in other words, the set of possible string formats used in the XML document). For example, the double value 4200 can be represented lexically in a variety of ways (see Figure 1).

Figure 1 Lexical Representation
Figure 1 Lexical Representation

Hence, when you specify that an element/attribute is of a given built-in type, the type's lexical space effectively controls what string formats are allowed in the document. For example, consider a schema that maps several local element declarations to XML Schema built-in types, as shown in Figure 2. When a schema processor validates an instance of this schema, it will ensure that the text contained within each of the elements/attributes conforms to a legal lexical representation of its defined type. The following instance is considered valid according to the schema:

  <tns:employee xmlns:tns="http://example.org/employee/"
  
id="555-12-3434">
<name>Monica</name>
<hiredate>1997-12-02</hiredate>
<salary>42000.00</salary>
</tns:employee>

The following instance would be considered invalid because the hiredate and salary element values aren't valid representations of their corresponding types:

  <tns:employee xmlns:tns="http://example.org/employee/"
  
id="555-12-3434">
<name>Monica</name>
<hiredate>Dec 12, 2002</hiredate>
<salary>42,000.00</salary>
</tns:employee>

The hiredate should be in the form of CCYY-MM-DD, while the salary shouldn't contain a comma. Both the value and lexical space details for a given type are defined in Part 2 of the XML Schema specification as well as in my book Essential XML Quick Reference (Addison-Wesley, 2001).
Although it seems that the set of built-in types should be sufficient, you'll surely encounter situations in which the most appropriate built-in data type won't do the job. For example, in the previous schema definition, the id attribute is defined to be of type string, but it must actually be in the format of a Social Security number for things to work properly. Obviously, the U.S. Social Security number format isn't universal enough to make it one of XML Schema's built-in types. To deal with situations like this without having to write your own validation code at the application level, XML Schema makes it possible to define custom simple types.

Deriving New Simple Types

XML Schema makes it possible to derive new simple types from existing simple types in a variety of ways. This is accomplished by using an xsd:simpleType element in the schema: <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"

*xmlns:tns="http://example.org/employee/"targetNamespace="http://example.org/employee/"><xsd:simpleType name="socialSecurityNumber"><!-- define characteristics of new simple type here --></xsd:simpleType><xsd:attribute name="id" type="tns:socialSecurityNumber"/></xsd:schema>

Just as with complex types, you can either name simple types (through the name attribute) or define them anonymously within text-only element and attribute declarations. Typically, developers choose to name their new simple types so they can refer to them from multiple element/attribute declarations.
In this case, the name of the new simple type is socialSecurityNumber and it's automatically associated with the schema's targetNamespace as I discussed in the April issue. Hence, when referring to this type, you must use a qualified name, as shown in the id attribute declaration.
You can base new simple types on existing types using three techniques: restricting a base type, creating a list of a given type, or defining a union of types. There is an element that represents each of these different derivation techniques, which can be nested directly within the xsd:simpleType element. These elements are xsd:restriction, xsd:list, and xsd:union (see Figure 3).
Deriving a new simple type by restriction is the most common case. This makes it possible to restrict the value space of an existing simple type to better meet your needs. Any instance of a restricted type is also a valid instance of the base type since the restricted type's value space is a proper subset of the base type's value space. Restricting the base type's value space also restricts the allowed lexical representations because there are fewer values to represent.
You specify the base type you want to restrict through the base attribute on xsd:restriction, as shown here:

<xsd:simpleType name="age"><xsd:restriction base="xsd:unsignedShort"><!-- restrict string's value space here --></xsd:restriction></xsd:simpleType>

You can restrict the base type's characteristics in a variety of ways. XML Schema makes this possible through type facets.*

Type Facets

*XML Schema defines a set of type characteristics, or facets, which can be used to restrict certain aspects of the base type. There are a total of 12 facets, but not all of them can be used on a given built-in type. Figure 4 lists facet descriptions.
Each facet makes it possible to restrict the value space of the base type in a different way. For example, you can use a combination of xsd:minExclusive, xsd:minInclusive, xsd:maxExclusive, and xsd:maxInclusive to define the allowed value range of number-based types. You can use a combination of xsd:length, xsd:maxLength, and xsd:minLength to control the length of string-based, binary-based, and list-based types. You can also use multiple xsd:enumeration elements to specify a fixed set of valid enumerated values. And for the Perl aficionados of the world, you've got xsd:pattern, which allows you to explicitly control the lexical pattern of a value through regular expressions.
Let's look at some examples. The schema in Figure 5 defines some new simple types based on existing types (including some defined in the same schema). The new xsd:simpleType definitions extend the XML Schema type system with new custom types that are more appropriate for the application at hand (see Figure 6).

Figure 6 The XML Schema Tax System
Figure 6 The XML Schema Tax System

Now the application can choose to use a more specific value space for a given element/attribute. Figure 7 illustrates the value spaces for each of the newly defined age types that derive from xsd:unsignedByte. As you can see, 18 is a valid value of teenAge, adultAge, normalAge, age, and unsignedByte, but it's not a valid value of infantAge or recordAge.

Figure 7 Age-derived Value Spaces
Figure 7 Age-derived Value Spaces

You can experiment with this using the validation utility I provided earlier. Just create an XML instance document like this

<age xmlns:tns="http://example.org/ages/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:type="tns:infantAge">2</age>

and run it through the validation utility, as shown here:
c:> validate age.xml -s ages.xsd
Then you can experiment with the different value spaces by changing the type name specified in xsi:type along with the value embedded in the start/end tags. I've provided the validate.js utility, ages.xsd, and age.xml for you to download from the link at the top of this article.
It's also worth mentioning again that you can use these simple types within xsd:complexType definitions as with any other built-in type. For example, you can represent people who have lived longer than anyone on record through the following complex type definition that uses the recordAge type from Figure 5:

<xsd:complexType name="oldPerson"><xsd:sequence><xsd:element name="name"type="xsd:string"/><xsd:element name="age"type="tns:recordAge"/></xsd:sequence></xsd:complexType><xsd:element name="oldie"type="tns:oldPerson"/>

The following two elements are valid instances of the oldPerson type and happen to represent the oldest man and woman in recorded history (according to the Guinness Book of World Records):

<tns:oldie xmlns:tns="..."><name>Antonio Todde</name><age>112</age></tns:oldie><tns:oldie xmlns:tns="...">
<name>Jeanne-Louise Calment</name><age>122</age></tns:oldie>

Now any instance of oldPerson that has an age value less than 101 or greater than 122 is invalid according to the schema.
It's also possible to restrict date ranges just as with numbers. The following schema fragment defines a new simple type that restricts the base type xsd:date to include only the date values of the 2002 Olympic events:

<xsd:simpleType name="olympicDates"><xsd:restriction base="xsd:date"><xsd:minInclusive value="2002-02-08"/><xsd:maxInclusive value="2002-02-24"/></xsd:restriction></xsd:simpleType><xsd:element name="date" type="tns:olympicDates"/>

The following element is a valid instance of the olympicDates type:
<tns:date xmlns:tns="...">2002-02-13</date> Anything outside of the specified date range (2/8/2002 through 2/24/2002) is invalid according to this type.
As a final example of restricting simple types, go back to the Social Security number problem from the first schema. In that schema, the id attribute was defined to be of type xsd:string, as shown here:
<xsd:attribute name="id" type="xsd:string"/> But the values should really be in the format of a U.S. Social Security number, which look like this: 555-12-3434.
You can define a new type derived from xsd:string that restricts the value space to this specific lexical pattern, which can be described by the following regular expression: \d{3}\-\d{2}\-\d{4}, as shown in the following xsd:simpleType definition:

<xsd:simpleType name="socialSecurityNumber"><xsd:restriction base="xsd:string"><xsd:pattern value="\d{3}\-\d{2}-\d{4}"/></xsd:restriction></xsd:simpleType><xsd:attribute name="id"type="tns:socialSecurityNumber"/>

Now, since I defined the id attribute to be of this new type instead of the base type, xsd:string, the schema processor can take over the process of validating the value's lexical format.
In addition to deriving simple types by restriction, you can also derive lists and unions from base types.*

Lists and Unions of Simple Types

*The concept of deriving lists and unions from base types is much simpler than xsd:restriction because you're not actually changing the value spaces of the base types. Instead, you're defining a new type that is either a list of values for a single type (xsd:list) or a single value from one of several types (xsd:union).
The following schema fragment defines a new simple type that is a list of recordAge values:
<xsd:simpleType name="listOfAges">``<xsd:list itemType="tns:recordAge" />``</xsd:simpleType>``<xsd:element name="ages" type="tns:listOfAges"/> A valid instance of this new type must contain a whitespace-delimited list of valid recordAge values, as shown here:
<tns:ages xmlns:tns="...">112 122 119</ages> The following schema fragment defines a new simple type that is a union of two existing types, namely infantAge and recordAge:

<xsd:simpleType name="extremeAge"><xsd:union memberTypes="tns:infantAge tns:recordAge" /></xsd:simpleType><xsd:element name="extremeAgeRanges"type="tns:extremeAge"/>

This means that an instance of extremeAge must contain a value from within either the infantAge or recordAge value spaces. For example, the following extremeAgeRanges element is valid because it contains the value of 2, which is within the infantAge value space defined as up to three years old:
<tns:extremeAgeRanges xmlns:tns="...">2</extremeAgeRanges> It would also be valid if it had a value between 101-122 on the other end of the spectrum.
That wraps up most of what you can do when deriving new simple types from existing simple types.*

Deriving New Complex Types

*In a similar fashion, you can continue extending your type hierarchy by deriving new complex types from existing simple/complex types. This is accomplished through a new xsd:complexType element that has either an xsd:simpleContent or xsd:complexContent element child.
xsd:simpleContent indicates that the new complex type is being derived from an existing type with simple content (just text content) while xsd:complexContent indicates that it's being derived from an existing type with complex content:

<xsd:complexType name="newDerivedComplexType"><!-- use xsd:simpleContent to derive from simple type --><!-- use xsd:complexContent to derive from complex type --></xsd:complexType>

Unlike simple types, you can derive complex types by either restriction or extension. You indicate the type of derivation by nesting either the xsd:restriction or xsd:extension elements within the xsd:simpleContent or xsd:complexContent elements, again depending on the base type.
When deriving by restriction, an instance of the derived type is always a valid instance of the base type. This means that each member of the derived type must have the same, or narrower, value space and occurrence constraints as that of the base type. You can restrict the facets or occurrence constraints of any complex type member using the techniques that I described earlier in this column for simple types.
Derivation by restriction isn't supported in most programming languages, although derivation by extension is. When deriving by extension, you cannot modify the facets of the base type. You can only extend its content. The content you define in the derived type logically follows the content of the base type. In this case, instances of derived types are typically not valid instances of the base type.
It's not possible to derive by both restriction and extension in a single definition. You can get this functionality, however, through two separate complex type definitions, one that derives by extension, and the other that restricts the new "extended" type.*

Derivation by Extension

*First I'll look at deriving complex types by extension, since that is what's typically supported in today's OO programming languages. Take a look at the complete schema definition shown in Figure 8. This schema defines the characteristics of the employee element through a factored type hierarchy.
The first example of deriving by extension can be found in the annotatedAge complex type definition. The xsd:simpleContent element within annotatedAge indicates that it's deriving from a type with simple content. The xsd:extension element within xsd:simpleContent indicates that the new type is extending the base type xsd:unsignedShort. The content of xsd:extension indicates how the base type is to be extended, in this case by adding a required unit attribute.
Since complex types can only be applied to elements, annotatedAge can only be used by elements. Hence, the annotatedAge type requires an element's content to be an unsignedShort value and, at the same time, the element must have a unit attribute with an appropriate ageUnits value.
This is really the only case in which you'd extend a simple type into a complex type, that is, when you want to associate attributes with existing simple types.
The xsd:complexType definition for Employee provides a more common example of deriving by extension from an existing complex type. In this case, I used xsd:complexContent to indicate that I'm deriving from an existing complex type, along with xsd:extension to indicate that I'm deriving by extension from Person. The content placed within xsd:extension logically follows the content of Person. The following element is a valid instance of the Employee type:

<tns:employee xmlns:tns="http://example.org/employee/"><name>Bob Smith</name><sex>male</sex><age unit="years">29</age><ssnum>323-12-9897</ssnum><salary>31432.50</salary></tns:employee>

Derivation by Restriction

Deriving complex types by restriction works a bit differently. Instead of extending the content model of the base type, you're restricting the value spaces and occurrence constraints of a complex type's members. When deriving by restriction from existing complex types you must list all of the base type's members, as well as any from ancestor types and specify the same or narrower value spaces and occurrence constraints for each one.
Take a look at the sections of the complete schema definition that are shown in Figure 9.
The annotatedAgeRestricted complex type restricts the value space of the base type, annotatedAge, to a range of 0-100 inclusive. Notice that the xsd:complexType definition contains the xsd:simpleContent element to indicate that the base type contains simple content. And the content of xsd:restriction is just like what I covered for simple type restrictions.
As another example, the complex type definition for NamedPerson restricts the AnonymousPerson complex type. It does so by narrowing the occurrence constraints on the name element (now it's mandatory whereas before it was optional) and by reducing the value space of the age element through a more specific type, annotatedAgeRestricted.

Substitution

*Substitution is one of the main benefits of defining type hierarchies. XML Schema makes it possible to perform two different types of substitution: type-based and element-based.
Type-based substitution allows you to substitute a derived type for a base type in the instance document. This technique relies on xsi:type and is actually the technique I used earlier while working with the age hierarchy. For example, consider the following schema fragment that defines a global personAge element to be of type age (and assume the age type hierarchy from earlier is also part of this schema):

<!-- definitions for age types omitted -->``<xsd:element name="personAge" type="tns:age"/>

When a schema processor encounters the personAge element, it will assume that it's of type age. Using type-based substitution, the instance can provide a more specific type that derives from age as shown here:

<tns:personAge xmlns:tns="..."xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">xsi:type="tns:infantAge">1</tns:personAge>

Now the processor treats personAge as an infantAge instance and the more specific facets apply.
Type-based substitution is similar to a type cast in most programming languages, which many programmers want to avoid, and it adds the extra burden of having to sprinkle xsi:type throughout your instances. As an alternative, you can perform element-based substitution by providing some additional information in the schema.
Element-based substitution requires you to define what are known as substitution groups in the schema. A substitution group specifies a set of elements that can be swapped wherever a given element, also known as the head of the substitution group, is referenced. An element indicates that it's a member of another element's substitution group through the substitutionGroup attribute. For example, the following schema fragment defines three additional elements, each of which joins the personAge substitution group:

<!-- personAge is head of substitution group --><xsd:element name="personAge" type="tns:age"/><!-- the following elements belong to the personAgesubstitution group --><xsd:element name="infantAge" type="tns:infantAge"substitutionGroup="tns:personAge"/>

<xsd:element name="teenAge" type="tns:teenAge"
substitutionGroup="tns:personAge"/><xsd:element name="adultAge" type="tns:adultAge"substitutionGroup="tns:personAge"/>

Now, any of these elements can be legally used wherever personAge is referenced. For example, assuming the following complexType

<xsd:element name="ages"><xsd:complexType><xsd:sequence><xsd:element ref="tns:personAge" maxOccurs="unbounded"/></xsd:sequence></xsd:complexType></xsd:element>

the following instance is valid:

<tns:ages xmlns:tns="..."><tns:adultAge>55</tns:adultAge><tns:teenAge>14</tns:teenAge><tns:personAge>20</tns:personAge><tns:infantAge>1</tns:infantAge><tns:adultAge>32</tns:adultAge></tns:ages>

Notice you're able to use any of the elements in personAge's substitution group without resorting to xsi:type. This technique lets you develop substitution groups over time seamlessly.*

Where To?

*This column, along with the April installment, covered all major concepts and functionality offered by XML Schema. I encourage you to continue exploring the exciting new world of XML made possible by XML Schema. On your journey, you may want to take a look at the W3C XML Schema Recommendation, Part 0 (Primer), Part 1 (Structures), and Part 2 (Datatypes). Also of interest is the House of Web Services column by Don Box in the November 2001 issue and the February 2002 issue of this magazine.

Send questions and comments for Aaron to xmlfiles@microsoft.com.*

Aaron Skonnard is an instructor/researcher at DevelopMentor, where he develops the XML and Web service-related curriculum. Aaron coauthored Essential XML Quick Reference (Addison-Wesley, 2001) and Essential XML (Addison-Wesley, 2000). Get in touch with Aaron at http://staff.develop.com/aarons.