W3C XML Schema Design Patterns: Avoiding Complexity

 

Dare Obasanjo
Microsoft Corporation

Originally published on https://www.xml.com.

January 2003

Applies to:
   W3C XML Schema

Summary: A counterpoint to Kohsuke Kawaguchi's article on guidelines for working with W3C XML Schema that revisits each of his original guidelines and offers agreement, disagreement, or clarification of each major point. Provides additional guidelines for what you should and should not do when working with W3C XML Schema. (25 printed pages.)

Contents

Introduction
The Guidelines
   Why you should use global and local element declarations
   Why you should use global and local attribute declarations
   Why you should understand how XML namespaces affect the W3C XML Schema
   Why you should always set elementFormDefault to "qualified"
   Why you should use attribute groups
   Why you should use model groups
   Why you should use the built-in simple types
   Why you should use complex types
   Why you should not use notation declarations
   Why you should use substitution groups carefully
   Why you should favor key/keyref/unique over ID/IDREF for identity constraints
   Why you should use chameleon schemas carefully
   Why you should not use default or fixed values especially for types of xs:QName
   Why you should use restriction and extension of simple types
   Why you should use extension of complex types
   Why you should use restriction of complex types carefully
   Why you should use abstract types carefully
   Do use wildcards to provide well-defined points of extensibility
   Do not use group or type redefinition
Conclusion
Acknowledgements

Introduction

Many new and not-so-new schema authors have struggled with various aspects of the W3C XML Schema language over the past year. Given the size and relative complexity of the W3C XML Schema recommendation (part one and part two), it seems that many schema authors would be best served by understanding and utilizing an effective subset of the features provided by W3C XML Schema instead of attempting to comprehend all of the esoteric and minutiae of the language.

There have been a few public attempts to define an effective subset ofW3C XML Schema for general usage, most notably the paper, W3C XML Schema Made Simple by Kohsuke Kawaguchi, and the X12 Reference Model for XML Design by the Accredited Standards Committee (ASC) X12. However, both documents are extremely conservative and conclude with advising against useful features of the W3C XML Schema language without adequately describing the costs of doing without these features.

The Guidelines

The following are Kohsuke's original guidelines, amended with additional recommendations:

  • Do use element declarations, attribute groups, model groups, and simple types.
  • Do use XML namespaces as much as possible; learn the correct way to use them.
  • Do not try to be a master of XML schema. It would take months.
  • Do (previously not in Kohsuke's original guidelines) use complex types and attribute declarations.
  • Do not use notations.
  • Do (previously not in Kohsuke's original guidelines) use local declarations.
  • Do (previously not in Kohsuke's original guidelines) use substitution groups carefully.
  • Do (previously not in Kohsuke's original guidelines) use a schema without the targetNamespace attribute (that is, chameleon schema) carefully.

In the ammendments to the guidelines, there is some disagreement with the directives advising against the usage of complex types, attribute declarations, and local declarations. Also included are clarifications to the guidelines on using chameleon schemas and substitution groups, highlighting their advantages and disadvantages.

Last, the following guidelines are proposed in addition to the ammended versions of Kohsuke's original guidelines:

  • Do favor key/keyref/unique over ID/IDREF for identity constraints.
  • Do not use default or fixed values, especially for types of xs:QName.
  • Do not use type or group redefinition.
  • Do use restriction and extension of simple types.
  • Do use extension of complex types.
  • Do use restriction of complex types carefully.
  • Do use abstract types carefully.
  • Do use elementFormDefault set to qualified and attributeFormDefault set to unqualified.
  • Do use wildcards to provide well-defined points of extensibility.

If you are a novice user, it is best to avoid the guidelines qualified with the word carefully unless the problem you seek to solve requires their use. The following dialogue provides a rationale for the aforementioned recommendations.

Why you should use global and local element declarations

An element declaration is used to specify the structure, type, occurrence and value constraints for an element. The element declaration is the most important and commonly used building block in a schema document.

Elements declarations that appear as children of the xs:schema element are considered to be global elements. Global elements can be reused by referencing them in other parts of the schema or from other schema documents. Another important characteristic of global elements is that they can be members of substitution groups. Since the W3C XML Schema recommendation does not provide a mechanism for specifying the root element of the document being validated, any global element can be used as the root element for a valid document.

Element declarations that appear within complex type or model group definitions that are not references to a global element are considered to be local elements. Unlike global elements, there can be many local element declarations with the same name and differing types in a schema as long as the local elements are not declared at the same level. Section 3.3 of the W3C XML Schema Primer gives the following example:

You can only declare one global element called "title", and that element is bound to a single type (e.g., xs:string or PersonTitle). However, you can locally declare one element called "title" that has a string type, and is a subelement of "book". Within the same schema (target namespace) you can declare a second element also called "title" that is an enumeration of the values "Mr Mrs Ms".

Global element declarations should be used for elements that will be reused from the target schema as well as from other schema documents—when the element and its associated type are comfortably bound together for widespread use. Local elements should be favored when element declarations only make sense in the context of the declaring type and need not be exposed for reuse.

By default, global elements have a namespace name equivalent to that of the target namespace of the schema, while local elements have no namespace name. This means that by default elements in an XML document to be validated against global element declarations should have their namespace name equal to that of the global element's schema target namespace while those to be validated against local elements should have no namespace name. This is illustrated in the following example.

test.xsd
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema" 
 targetNamespace="https://www.example.com"
 xmlns="https://www.example.com">

 <!-- global element declaration validates <language> elements from 
         https://www.example.com namespace  -->
 <xs:element name="language" type="xs:string" />
 <xs:element name="Root" type="sequenceOfLanguages" />
 <xs:element name="Root2" type="sequenceOfLanguages2" />
 
 <!-- complex type with local element declaration  validates <language> 
         elements without a namespace name -->
 <xs:complexType name="sequenceOfLanguages" >  
  <xs:sequence>
   <xs:element name="language" type="xs:NMTOKEN" maxOccurs="unbounded" />
  </xs:sequence>
 </xs:complexType>

 <!-- complex type with reference to global element declaration -->
  <xs:complexType name="sequenceOfLanguages2" >  
  <xs:sequence>
   <xs:element ref="language" maxOccurs="10" />
  </xs:sequence>
 </xs:complexType>
</xs:schema>

test.xml
<?xml version="1.0"?>
<ex:Root xmlns:ex="https://www.example.com">
 <language>EN</language> 
</ex:Root> 

test2.xml
<?xml version="1.0"?>
<ex:Root2 xmlns:ex="https://www.example.com">
 <ex:language>English</ex:language> 
 <ex:language>Klingon</ex:language> 
</ex:Root2> 

Why you should use global and local attribute declarations

An attribute declaration is used to specify the type, optionality and default information for an attribute.

Attribute declarations that appear as children of the xs:schema element are considered to be global attributes. Global attributes can be reused by referencing them in other parts of the schema or from other schema documents. Attribute declarations that appear within complex type definitions that do not reference global attributes are considered to be local attributes.

Global attribute declarations should be used for types that will be reused from the target schema as well as from other schema documents. Local attributes should be favored when attribute declarations only make sense in the context of the declaring type and need not be exposed for reuse. Because attributes are usually tightly coupled to their parent elements, schema authors typically favor local attribute declarations. However, there are cases where global attributes that can apply to many elements from multiple namespaces are useful, such as xsi:type and xsi:schemaLocation.

Note   By default, global attributes have a namespace name equivalent to that of the target namespace of the schema, while local attributes have no namespace name. This means that by default attributes in an XML document to be validated against global attribute declarations should have their namespace name equal to that of the global attribute's schema target namespace, while those to be validated against local attributes should have no namespace name. This is illustrated in the following example.

test.xsd
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema" 
 targetNamespace="https://www.example.com" 
 xmlns="https://www.example.com">

 <!-- global attribute declaration validates language attributes from 
         https://www.example.com namespace  --> 
 <xs:attribute name="language" type="xs:string" />
 <xs:element name="Root" type="sequenceOfNotes" />
 <xs:element name="Root2" type="sequenceOfNotes2" />

 <!-- complex type with local attribute declaration validates language 
         attributes without a namespace name -->
 <xs:complexType name="sequenceOfNotes" >  
  <xs:sequence>
   <xs:element name="Note" type="xs:string" />
  </xs:sequence>
  <xs:attribute name="language" type="xs:NMTOKEN"  /> 
 </xs:complexType>

 <!-- complex type with reference to global attribute declaration -->
  <xs:complexType name="sequenceOfNotes2" >  
  <xs:sequence>
   <xs:element name="Note" type="xs:string" />
  </xs:sequence>
  <xs:attribute ref="language" />
 </xs:complexType>
</xs:schema>



test.xml
<?xml version="1.0"?>
<ex:Root xmlns:ex="https://www.example.com" language="EN" >
 <Note>Nothing to see here</Note> 
</ex:Root> 

test2.xml
<?xml version="1.0"?>
<ex:Root2 xmlns:ex="https://www.example.com" ex:language="The English 
         Language">
 <Note>Nothing to see here</Note> 
</ex:Root2> 

Why you should understand how XML namespaces affect the W3C XML Schema

Support for XML namespaces is woven tightly into the W3C XML Schema recommendation. Namespaces are used in a number of places, such as when:

  • Referencing global elements, attributes or types
  • Using XPath expressions for the identity constraints
  • Determining what elements and attributes the declarations in the schema can validate
  • Importing and including other schema documents

For this reason, it is advisable for schema authors to be familiar with how namespaces work as well as how they specifically affect W3C XML Schema. The MSDN article, XML Namespaces and How They Affect XPath and XSLT provides a detailed overview of the ins and outs of XML namespaces, while the article, Working with Namespaces in XML Schema goes over critical information that explains the ramifications of namespaces in W3C XML Schema.

Why you should always set elementFormDefault to "qualified"

In a previous section, it is mentioned that by default global declarations validate elements or attributes with a namespace name, while local declarations validate elements or attributes without a namespace name. The term used to describe elements or attributes with a namespace name is namespace qualified. It is possible to override the default behavior in regards to whether local declarations validate namespace qualified elements and attributes or not. The xs:schema element has the elementFormDefault and attributeFormDefault attributes, which specify whether local declarations in the schema should validate namespace-qualified elements and attributes, respectively. The valid values for either attribute are qualified and unqualified. The default value of both attributes is unqualified.

Also, the form attribute on local element and attribute declarations can be used to override the values of the elementFormDefault and attributeFormDefault attributes specified on the xs:schema element. This allows for finer-grained control of how validation of elements and attributes in the instance document should operate in relation to global versus local declarations.

The following example taken from the "Why You Should Avoid Local Declarations" section in Kohsuke's article shows exactly how these attributes can significantly affect the outcome of validation.

The schema:

<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema"
      targetNamespace="https://example.com">
  <xs:element name="person">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="familyName" type="xs:string" />
        <xs:element name="firstName" type="xs:string" />
      <xs:sequence>
    <xs:complexType>
  <xs:element>
<xs:schema>

validates the following document:

<foo:person xmlns:foo="https://example.com">
  <familyName> KAWAGUCHI <familyName>
  <firstName> Kohsuke <firstName>
<foo:person>

...and is not usually what was intended by the schema author and is rather ugly to boot. Altering the schema to:

<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema"
      targetNamespace="https://example.com" 
     elementFormDefault="qualified">
  <xs:element name="person">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="familyName" type="xs:string" />
        <xs:element name="firstName" type="xs:string" />
      <xs:sequence>
    <xs:complexType>
  <xs:element>
<xs:schema>

...allows it to validate the following document:

<person xmlns="https://example.com">
  <familyName> KAWAGUCHI <familyName>
  <firstName> Kohsuke <firstName>
<person>

—or—

<foo:person xmlns:foo="https://example.com">
  <foo:familyName> KAWAGUCHI <foo:familyName>
  <foo:firstName> Kohsuke <foo:firstName>
<foo:person>

Leaving the attributeFormDefault attribute with its unqualified default value is desirable because most schema authors do not want to have to explicitly namespace qualify all attributes by prefixing them.

Why you should use attribute groups

An attribute group definition is a way to create a named collection of attribute declarations and attribute wildcards. Attribute groups aid modularity of schemas by allowing one to declare commonly used sets of attributes in a single location and then to reference them from one or more schema(s).

Note   When Koshuke's article describes attribute groups as an alternative to global attribute declarations, it may give the incorrect impression that the two are mutually exclusive alternatives. A globally declared attribute is an individual, reusable attribute declaration. An attribute group, on the other hand, is a modularly clustered set of attributes; the attribute declarations in an attribute group can be either local attribute declarations or references to global declarations. The attribute declarations in an attribute group can either be local or references to global declarations. Thus, Koshuke's article is not entirely accurate when it describes attribute groups as an alternative to global attribute declarations.

Why you should use model groups

A model group definition is a mechanism for creating named groups of elements using the all, choice, or sequence compositors. Model groups are useful for reusing groups of elements in situations where one wants to avoid the baggage of complex types (specifically type derivation). However, model groups are not a replacement for complex types because they cannot contain attribute declarations nor can they be specified as the type of an element declaration. Additionally, derivation of model groups is much more limited than is the derivation of complex types.

Why you should use the built-in simple types

A major benefit of W3C XML Schema over DTDs in XML 1.0 is the existence of data types. The ability to specify that the values of elements or attributes are strings, dates or numeric data enables schema authors to specify and validate the contents of XML data in an interoperable and platform-independent manner. Given the large number of built-in datatypes (about 44), it may be wise for schema authors to standardize on a subset of the built-in types to avoid information overload.

In most cases users can do without the subtypes of xs:string (such as xs:ENTITY or xs:language), the subtypes of xs:integer (such as, xs:short or xs:unsignedByte), or the Gregorian date types (such as, xs:gMonthDay or xs:gYearMonth). Eliminating the aforementioned types reduces the amount of types to keep track of to a manageable amount.

Why you should use complex types

A complex type definition is used for specifying a content model consisting of elements and attributes. An element declaration can specify its content model by referring to a named or anonymous complex type. Named complex types can be referenced by name from the schema they are defined in, or external schema documents, while anonymous complex types must be defined within the declaration for the element that uses the type. Additionally, the content models of named complex types can be extended or restricted using W3C XML Schema inheritance mechanisms.

Complex types are similar to model group definitions with two main differences: complex type definitions can include attributes in the content models they define, and it is possible to use type derivation with complex types, which isn't the case with named model groups. In Kohsuke's article, he advocates using a combination of anonymous complex types, model group definitions, and attribute groups to specify the content model of an element instead of named complex types in an attempt to avoid dealing with the "complexity" of named complex types. However, it is countered that using three mechanisms instead of one to specify the content model of an element is actually more prone to confusion. Thus, besides the fact that named complex types allow for reuse of content models, they are also the most straightforward way of specifying the content model of an element.

Anonymous complex types should only be used if references to the type will not be needed outside the element declaration and there is no need for type derivation. It is important to note that it is not possible to derive a new type from an anonymous complex type. In general, schemas that make heavy use of anonymous types are likely to have problems with uniformity and consistency.

Why you should not use notation declarations

Kohsuke's admonition to avoid notation declarations is exactly right. They exist only to provide backwards compatibility with DTDs, except they are not backwards compatible with DTD notations. It is easier to pretend they do not exist.

Why you should use substitution groups carefully

Substitution groups provide a mechanism for XML elements that is similar to subtype polymorphism in programming languages. One or more elements can be marked as being substitutable for a global element (also called the head element), which means that members of this substitution group are interchangeable with the head element in a content model. For example, for an Address substitution group with members USAddress and UKAddress, the generic element Address can be used in the content model, or it can be substituted by a USAddress or a UKAddress. The only requirement is that the members of the substitution group must be of the same type or be in the same type hierarchy as the head element.

The following is an example schema and the instance that it validates:

example.xsd:
 <xs:schema 
 xmlns:xs="https://www.w3.org/2001/XMLSchema"
 targetNamespace="https://www.example.com"
 xmlns:ex="https://www.example.com"
 elementFormDefault="qualified">

  <xs:element name="book" type="xs:string" />

  <xs:element name="magazine" type="xs:string" substitutionGroup="ex:book" />

 <xs:element name="library">
 <xs:complexType>
  <xs:sequence>
   <xs:element ref="ex:book" maxOccurs="unbounded"/>
  </xs:sequence>
 </xs:complexType>
 </xs:element>


</xs:schema>
example.xml:
<library xmlns="https://www.example.com">
 <magazine>MSDN Magazine</magazine>
 <book>Professional XML Databases</book>
</library>

In the previous example, the content model of the library element permits it to hold one or more book elements. Since magazine elements are in the book substitution group, then it is valid for magazine elements to appear in the instance XML where book elements are expected.

Substitution groups make content models more flexible and allow extensibility in directions the schema author may not have anticipated. This flexibility is a double-edged sword because although it allows greater extensibility, it makes it harder to process documents based on such schemas. For instance, in the previous example the code that processes the library element must not only handle its child book elements but magazine elements as well. If the instance document specified additional schemas via the xsi:schemaLocation attribute, then the processing application could have to deal with even more members of the book substitution group as children of the library element.

Further complicating matters is that members of a substitution group can be of a type derived from the substitution group's head. Writing code to properly handle any derived type generically is difficult, especially since there are two opposite notions of derivation: restriction, which restricts the range or values in the content model; and extension, which adds elements or attributes to the content model. Certain attributes on element declarations can be used to give schema authors more control over element substitutions in instance documents and reduce the likelihood of facing unexpected substitutions in XML instance documents. The block attribute is used to specify whether elements whose types use a certain derivation method can substitute for the element in an instance document, while the final attribute is used to specify whether elements whose types use a certain derivation method can declare themselves to be part of the target element's substitution group. The default values of the block and final attributes for all element declarations in a schema can be specified via the blockDefault and finalDefault attributes of the root xs:schema element. By default, all substitutions are allowed without limitation.

Why you should favor key/keyref/unique over ID/IDREF for identity constraints

DTDs provide a mechanism for specifying that an attribute has a type ID, meaning that its value will be unique within the document and that it matches the Name production in XML 1.0. IDs in XML 1.0 can also be referenced by attributes of type IDREF or IDREFS. For compatibility with DTDs, W3C XML Schema has the xs:ID, xs:IDREF and xs:IDREFS types.

W3C XML Schema identity constraints are used for specifying unique values, keys or references to keys using XPath expressions defined within the scope of an element declaration. Comparing feature for feature, the identity constraint mechanisms offer more than their ID/IDREF counterparts do. For one, there is no limitation on the values or types that can be used as part of an identity constraint, whereas IDs can only be one of a specific range of values (for example, 7 is not a valid ID). A more important benefit of the schema identity constraints over ID/IDREF is that, while the latter have to be unique within the document, the former do not. In other words, the symbol space for unique IDs is the entire document, while for unique keys it is the target scope of the XPath. This is particularly useful if uniqueness is needed in two overlapping value spaces with different scopes in the same XML document. An example of this would be an XML document that contained room numbers and table numbers for a hotel. It is likely that some of the numbers overlap (i.e. there is a room 18 and a table 18), but they should not overlap within either value space.

Note   The W3C XML Schema family of ID types are not exactly compatible with the DTD ID types. For one, the xs:ID, xs:IDREF and xs:IDREFS types can be applied to both elements and attributes in the W3C XML Schema, although they can only apply to attributes in their DTD equivalents. Secondly, there is no restriction on how many attributes of type xs:ID can appear on an element, although such a restriction exists for ID attributes in the DTD equivalents.

Why you should use chameleon schemas carefully

The target namespace of a schema document identifies the namespace name of the elements and attributes which can be validated against the schema. A schema without a target namespace can typically only validate elements and attributes without a namespace name. However, if a schema without a target namespace is included into a schema with a target namespace, then the target namespace-less schema assumes the target namespaces of the including schema. This feature is typically called the chameleon schema design pattern.

In Kohsuke's article, he claimed that the chameleon schema pattern does not work, which is incorrect. A full rebuttal of Kohsuke's claim was made by Michael Leditschke on XML-DEV which shows that the design pattern does work and is useful for creating a module of type definitions and declarations which can be reused as needed.

There is a problem with combining chameleon schemas with identity constraints. The problem is that although QName references to type definitions and declarations in the chameleon schema are coerced into the namespace of the including schema, the same is not done for XPath expressions used by xs:key, xs:keyref and xs:unique identity constraints. Consider the following schema:

<xs:schema
 xmlns:xs="https://www.w3.org/2001/XMLSchema"
 elementFormDefault="qualified">

 <xs:element name="Root">

  <xs:complexType>
    <xs:sequence>
     <xs:element name="person" type="PersonType" maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>

  <xs:key name="PersonKey">
   <xs:selector xpath="person"/>
   <xs:field xpath="@name"/>
  </xs:key>

  <xs:keyref name="BestFriendKey" refer="PersonKey">
   <xs:selector xpath="person"/>
   <xs:field xpath="@best-friend"/>
  </xs:keyref>

 </xs:element>

 <xs:complexType name="PersonType">
  <xs:simpleContent>
   <xs:extension base="xs:string">
    <xs:attribute name="best-friend" type="xs:string" />
    <xs:attribute name="name" type="xs:string" />
   </xs:extension>
  </xs:simpleContent>
 </xs:complexType>

</xs:schema>

If the previous schema is included into another with a target namespace, then the XPath expressions in both the key and keyref will fail. In this specific example, the person element is in no namespace in the chameleon schema, but once included in another schema, it then picks up that target namespace. The XPath expressions which match on a person without a target namespace will simply not work without signifying that they no longer work, since processors are not obligated to ensure that path expressions in identity constraint actually return results.

The major point is that it is not advisable to use identity constraints in chameleon schemas.

Why you should not use default or fixed values especially for types of xs:QName

The primary complaint against default and fixed values is that they cause new data to be inserted into the source XML after validation, thus changing the data. This means that a document that has a schema with default values that has not been validated is incomplete. Tying the actual content of the XML document to the validation process is unwise since a schema may not always be available nor is it wise to assume that consumers of the document will always perform validation.

The xs:QName type has additional problems when it comes to validation caused by the fact that it has no canonical form. Consider the following schema and XML instance:

 <xs:schema
 xmlns:xs="https://www.w3.org/2001/XMLSchema"
 targetNamespace="https://www.example.com"
 xmlns:ex="https://www.example.com"
 xmlns:ex2="ftp://ftp.example.com"
 elementFormDefault="qualified">

 <xs:element name="Root">
  <xs:complexType>
    <xs:sequence>
     <xs:element name="Node" type="xs:QName" default="ex2:FtpSite" />
    </xs:sequence>
  </xs:complexType>
 </xs:element>

</xs:schema>

<Root xmlns="https://www.example.com" 
  xmlns:ex2="smtp://smtp.example.org" 
  xmlns:foo="ftp://ftp.example.com">
 <Node />
</Root>

In the previous scenario, what value should be inserted into the Node element upon validation? Should it be ex2:FtpSite even though the ex2 prefix is mapped to a different namespace in the instance document than in the schema? Maybe it should be "foo:FtpSite," because the "foo" prefix is mapped to the same namespace that ex2 was mapped to in the schema? But then, what would have happened if no XML namespace declaration existed for the ftp://ftp.example.com namespace? Would a namespace declaration have to be inserted into the XML? None of these questions can be answered in a satisfactory manner without violating some opinions as to what the correct behavior should be. It is best to avoid using xs:QName default values because it is unlikely that different implementations agree on the semantics in the previous case.

Why you should use restriction and extension of simple types

As previously mentioned, there are two forms of type derivation in W3C XML Schema: restriction and extension. Restriction of a simple type involves constraining the facets of the type and thus reducing the permitted values of the type. Such restrictions involve specifying a maximum length for a string value, specifying a date range, or enumerating the list of permitted values. Schema authors commonly use types constrained in this manner, and these types account for most uses of type derivation in W3C XML Schema. Such types can be used by both elements and attributes as their type definitions.

Extension of simple types allows one to create a complex type (that is, an element content model) with simple content that has attributes. A typical extension scenario is any situation where an element declaration has a simple type as its content and one or more attributes. Since such element content models occur commonly in XML documents, derivation by extension is another commonly used feature used by schema authors.

It should be noted that just like with complex types, there are named and anonymous simple types. Named simple types can be referenced by name from the schema they are defined in, or external schema documents, while anonymous simple types must be defined within the declaration for the element or attribute which uses the type. Also, as with complex types, type derivation can only be performed on named types.

A common misconception is that anonymous types with the same structure are the same type:

<-- fragment A -->

<xs:element name="quantity">
 <xs:simpleType>
   <xs:restriction base="xs:positiveInteger">
    <xs:maxExclusive value="100"/>
   </xs:restriction>
  </xs:simpleType>
</xs:element>

<xs:element name="size">
 <xs:simpleType>
   <xs:restriction base="xs:positiveInteger">
    <xs:maxExclusive value="100"/>
   </xs:restriction>
  </xs:simpleType>
</xs:element>

is equivalent to:

<-- fragment B -->

<xs:simpleType name="underHundred">
 <xs:restriction base="xs:positiveInteger">
  <xs:maxExclusive value="100"/>
 </xs:restriction>
</xs:simpleType>

<xs:element name="size" type="underHundred"/> 

<xs:element name="quantity" type="underHundred"/>

Basically, assuming that both element declarations in fragment B have the same type is incorrect. Even though two types can have same structure, they are not considered identical unless they are named, have the same name, and are from the same target namespace. Various aspects of W3C XML Schema may require element declarations to have the same type, such as substitution groups, specifying key/keyref pairs, and type derivation. For instance, a keyref must be of the same type as a key. However, most features of W3C XML Schema would consider the element declarations in fragment A to have different types, and those in fragment B to have the same type.

Why you should use extension of complex types

Extension of a complex type involves adding extra attributes or elements to the content model in the derived type. Elements added via extension are treated as being appended to the content model of the base type in sequence. This technique is useful for extracting the common aspects of a set of complex types and then reusing these commonalities via extending the base type definition. The following schema fragment showing how extensions enable the reuse of common aspects of a mailing address is taken from the discussion on complex type extension and example in the W3C XML Schema Primer.

<xs:complexType name="Address">
  <xs:sequence>
   <xs:element name="name"   type="xs:string"/>
   <xs:element name="street" type="xs:string"/>
   <xs:element name="city"   type="xs:string"/>
  </xs:sequence>
 </xs:complexType>

 <xs:complexType name="USAddress">
  <xs:complexContent>
   <xs:extension base="Address">
    <xs:sequence>
     <xs:element name="state" type="USState"/>
     <xs:element name="zip"   type="xs:positiveInteger"/>
    </xs:sequence>
   </xs:extension>
  </xs:complexContent>
 </xs:complexType>

 <xs:complexType name="UKAddress">
  <xs:complexContent>
   <xs:extension base="Address">
    <xs:sequence>
     <xs:element name="postcode" type="UKPostcode"/>
    </xs:sequence>
    <xs:attribute name="exportCode" type="xs:positiveInteger" fixed="1"/>
   </xs:extension>
  </xs:complexContent>
 </xs:complexType>

In the previous schema, the Address type defines the information common to addresses in general, while its derived types add information specific to addresses from the United States and United Kingdom, respectively. The ability to reuse and build upon content models using extension is a powerful and useful feature of W3C XML Schema that promotes modularity and uniformity of like content.

There is a caveat for processors that deal with types derived by extension. This caveat has to do with type-aware processors and the elements added to a content model via extension. In the future, it is possible that type-aware languages like XQuery/ or XLST 2.0 will be able to process XML elements and attributes polymorphically. For instance, an application can decide to process all elements of type Address or that have Address as their base type, choosing to process the information that is common to all types. However, a query such as the following:

//*[. instance of Address]/city 

could have it return unexpected results if dealing with a derived type that extended the content model in the following way:

 <xs:complexType name="BadAddress">
  <xs:complexContent>
   <xs:extension base="Address">
    <xs:sequence>
     <-- address format has two city entries, one for neighborhood and 
            another for the actual city -->
     <xs:element name="city" type="xs:string"/>
     <xs:element name="state" type="xs:string"/>
     <xs:element name="country" type="xs:string"/>
    </xs:sequence>
    <xs:attribute name="exportCode" type="positiveInteger" fixed="1"/>
   </xs:extension>
  </xs:complexContent>
 </xs:complexType>

Although the previous example is contrived and the scenario seems unlikely, it demonstrates a real risk. A more detailed exposition on this potential problem has been documented by Paul Prescod on XML-Dev.

Why you should use restriction of complex types carefully

Restriction of complex type involves creating a derived complex type whose content model is a subset of the base type.

First, some words of warning. The parts of the W3C XML Schema recommendation that describe derivation by restriction in complex types (Section 3.4.6 and Section 3.9.6) are generally considered to be the most complex part of a document that is generally difficult to understand. Most bugs in implementations congregate around correctly supporting this feature, and it is quite common to see implementers express exasperation when discussing the various nuances of derivation by restriction in complex types. Another issue is that derivation by restriction of complex types does not neatly map to concepts in either object-oriented programming or relational database theory, which are primary sources and consumers of XML data. This is the exact opposite of the situation with derivation by extension of complex types.

Another challenge in using derivation by restriction of complex types arises from the way in which restrictions are declared: when a given complex type is to be derived by restriction from another complex type, its content model must be duplicated and refined. Duplication of a definition replicates definitions, possibly down a long derivation chain, so any modification to an ancestor type must be manually propagated down the derivation tree. Furthermore, such replication cannnot cross namespace boundaries—deriving ns2:SlowCar from ns1:Car may not work if ns2:SlowCar's has a child element, ns2:MaxSpeed, because it cannot be correctly derived from ns1:Car's child element, ns1:MaxSpeed.

The following schema uses derivation by restriction to restrict a complex type that describes a subscriber to the XML-DEV mailing list to a type that describes Dare Obasanjo. Any element that conforms to the DareObasanjo type can also be validated as an instance of the XML-Deviant type.

<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema>

 <!-- base type -->
 <xs:complexType name="XML-Deviant">
  <xs:sequence>
   <xs:element name="numPosts" type="xs:integer" minOccurs="0"
maxOccurs="1" /> 
   <xs:element name="signature" type="xs:string" nillable="true" />
  </xs:sequence>
  <xs:attribute name="firstSubscribed" type="xs:date" use="optional" />
  <xs:attribute name="mailReader" type="xs:string"/>
 </xs:complexType>

 <!-- derived type --> 
  <xs:complexType name="DareObasanjo">
   <xs:complexContent>
   <xs:restriction base="XML-Deviant">
   <xs:sequence>
    <xs:element name="numPosts" type="xs:integer" minOccurs="1" /> 
    <xs:element name="signature" type="xs:string" nillable="false" />
   </xs:sequence>
   <xs:attribute name="firstSubscribed" type="xs:date" use="required" />
   <xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" />
   </xs:restriction>
   </xs:complexContent>
  </xs:complexType> 

</xs:schema>

Derivation by restriction of complex types is a multifaceted feature that is useful in situations where secondary types need to conform to a generic primary type, yet add their own constraints that go beyond those of the primary type. However, its extreme complexity requires that only those who have a thorough understanding of the W3C XML Schema recommendation use it.

Why you should use abstract types carefully

Borrowing a concept from object-oriented programming languages like C# and Java, both element declarations and complex type definitions can be made abstract. An abstract element declaration means that the element declaration cannot be used to validate an element in an XML instance document and can only appear in content models via substitution. Similarly, an abstract complex type definition cannot be used to validate an element in an XML instance document, but can be used as the abstract parent of an element's derived type, or in cases where the element's type is overridden in the instance using xsi:type.

Abstract complex types and element declarations are useful for creating generic base types that contain information common to a set of types (such as Shape versus Circle or Square) yet the definition is not deemed "complete" unless further derivation (extension or restriction) has been applied. While this feature is not complicated to use, some implications of its use are subtle and complex and thus, abstract types should be used with care.

Do use wildcards to provide well-defined points of extensibility

W3C XML Schema provides the wildcards xs:any and xs:anyAttribute which can be used to allow the occurrence of elements and attributes from specified namespaces into a content model. Wildcards allow schema authors to enable extensibility of the content model while maintaining a degree of control over the occurrence of elements and attributes. A good discussion of the benefits of using wildcards is available in W3C XML Schema Design Patterns: Dealing with Change.

Cautious schema authors concerned with the problems posed by type derivation may choose to block attempts at type derivation using the final attribute on complex type definitions and element declarations (similar to sealed in C# and final in Java), then allow extensibility at specific parts of the content model by using wildcards. This gives schema authors more control over the content models they define and may reduce some of the problems with various aspects of complex type derivation (specifically, derivation by extension).

It should be noted that wildcards sometimes cause problems with non-determinism that violate the Unique Particle Attribution rule if used improperly. The following schema causes such a problem because:

<?xml version="1.0" encoding="utf-8" ?>
<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema" 
 targetNamespace="https://www.example.com/fruit/"
 elementFormDefault="qualified">

<xs:complexType name="myKitchen">
        <xs:choice maxOccurs="unbounded">
              <xs:any processContents="skip" />
              <xs:element name="apple" type="xs:string"/>
              <xs:element name="cherry" type="xs:string"/>            
        </xs:choice>
</xs:complexType>

</xs:schema>

...the content model of the myKitchen type is that it can contain one or more apple, cherry or any other element. However, during validation, if an <apple> element is seen, the compiler cannot tell whether it should be validated against the wildcard or the apple element declaration which leads to ambiguity.

There are subtle but potentially profound ramifications to the selection of both the namepsace attribute and the processContents attribute. Overly restrictive values can impede extensibility, while overly open values can expose the schema to misuse. Controlling the supported namespaces for a wildcard can also be bewildering, especially when the set of allowable namespaces is subject to change.

Do not use group or type redefinition

Redefinition is a feature of the W3C XML Schema that allows one to change the meaning of an included type or group definition. Using xs:redefine, schema authors can include type or group definitions from schema documents and then alter these definitions in a pervasive manner. Redefinition is pervasive because it not only affects type or group definitions in the including schema but also those in the included schema as well. Thus, all references to the original type or group in both schemas refer to the redefined type, while the original definition is overshadowed. This leads to the problems pointed out in W3C XML Schema Design Patterns: Dealing with Change, where it is stated:

This causes a certain degree of fragility because redefined types can adversely interact with derived types and generate conflicts. A common conflict is when a derived type uses extension to add an element or attribute to a type's content model, and a redefinition also adds a similarly named element or attribute to the content model.

A major problem with type redefinition is that, unlike type derivation, it cannot be prevented with the use of the block or final attributes. Thus, any schema can have its types redefined in a pervasive manner, thus altering their semantics completely. It is advisable to avoid this feature due to the potential conflicts it can cause.

Many schema authors attempt to use type redefinition to increase the value space of an enumeration, but this does not work. The only way to increase the amount of values accepted by an enumeration used as a base type is to create a union. However, those additional values are only available to applications of the resultant union type, not for the applications of the original base type. Also note that chained redefinitions (redefining a redefine) can be problematic, resulting in unexpected definition clashes.

Conclusion

The W3C XML Schema recommendation is a complex specification because it attempts to solve complex problems. One can reduce this burden of complexity by utilizing the simpler aspects of the schema language and forgoing the rest. Either way, schema authors should ensure that their schemas validate in multiple schema processors to avoid the all too common problem of "it works on my machine with validator X but not on the receiving end using validator Y," due to differences in the way various implementations handle the language. Remember, schemas are an important facilitator of interoperability and it would be folly for a schema author to depend on the nuances of a specific implementation and inadvertently give up this interoperability.

Acknowledgments

Thank you to Priya Lakshminarayanan and Mark Feblowitz for their help with this article.