XML Indexing Part 1: XML IDs, XSLT Keys and IndexingXPathNavigator

Article
09/04/2007

Oleg Tkachenko
Multiconn Technologies

March 2005

Applies to:
XML indexing
Microsoft .NET Framework

Summary: This two-part article discusses various aspects of indexing XML data. The first part examines indexing of standalone XML documents, a native XML concept of IDs, XSLT keys, and introduces the IndexingXPathNavigator class, which enables indexing for any IXPathNavigable in-memory XML store, such as XmlDocument, XpathDocument, or XmlDataDocument classes. The next part will be focused on new XML indexing features introduced in the Microsoft SQL Server 2005. (20 printed pages)

Download the article's sample code.

Contents

Introduction
XML IDs
XSLT Keys
Indexing of In-Memory XML Stores
Using Keys with XSLT and IndexingXPathNavigator
Conclusion

Introduction

Anybody working with data knows how important proper data indexing is. When used wisely, database indexing can provide an amazing performance boost for data querying, all for a price of building and maintaining indexes.

While XML was born to serve the needs of the document-centric world, it appeared to be so successful that nowadays XML is used everywhere, particularly in data-centric applications. And as XML conquers the data world, the importance of effective XML querying is growing rapidly. Databases have spoken XML for years now. Furthermore, major databases such as Microsoft SQL Server adopt XML as a native database type and hence implement smart facilities for the effective storing and querying of XML data.

XML, with its unique ability to express semi structured content, schema optionality, and implicit parent-child and siblings relationships, doesn't really fit into a tabular relational data model. So it shouldn't be a surprise that indexing XML and indexing a database are completely different.

In this two-part article I'm going to discuss various XML indexing aspects. This first part covers the indexing of stand-alone XML documents—native XML concept of IDs and XSLT keys—and introduces the IndexingXPathNavigator class, which enables indexing for any IXPathNavigable in-memory XML store, such as XmlDocument, XpathDocument, or XmlDataDocument classes. In the next part I'll focus on new XML indexing features introduced in the SQL Server 2005.

What's an index? Well, I bet you know what it is. Just take the "Writing Secure Code" book from your book shelf (I'm sure it's there) and skim page by page looking for the "buffer overrun" term. How long would it take? Hours, or even days. Now, how long would it take when using the book's index instead? Seconds! And you even don't have to search through the whole index; just skim alphabetically sorted lines in the "B" section. Nice performance boost! But of course building a book index doesn't take seconds. It takes weeks of work. And the index itself increases the book's volume, too. But now any reader can search in the book in seconds.

That's a good example of how effective indexing can be, and of the unavoidable trade-offs involved.

So, basically, an index is a set of pointers, which makes it easy to find data. And XML indexing is any facility that allows the finding of data in an XML document without full traversal of the XML tree. Let's start with native XML indexing facilities.

XML IDs

The most basic and rudimentary form of XML indexing is the concept of XML unique identifiers (IDs). As per XML specifications, any element in an XML document can be assigned a unique identifier with an ID typed attribute (an attribute, whose type is defined in DTD as ID):

<!DOCTYPE orders [
  <!ELEMENT order EMPTY>
  <!ATTLIST order
    orderno ID #REQUIRED
    date CDATA #REQUIRED
    shipAddress CDATA #REQUIRED
  >
  <!ELEMENT orders (order*)>
]>
<orders>
  <order orderno="id10952" date="4/15/96" shipAddress="Obere Str. 57"/>
  <order orderno="id10535" date="6/13/95" shipAddress="Mataderos  2312"/>   
</orders>

Here you can see that every order element has an ID assigned by the orderno attribute value. Conceptually it's assumed that an XML parser would leverage this information to build an index table consisting of order elements and their ID values.

Figure 1. Conceptual index table for IDs in an XML document

This index table then can be accessible through various APIs such as the XmlDocument.getElementbyId() and XPathNavigator.MoveToId() methods, or the XPath id() function:

XmlDocument doc = new XmlDocument();
doc.Load("d:/orders.xml");
//Fast select by ID using DOM API
XmlElement order = doc.GetElementById("id10535");
if (order != null)
  Console.WriteLine(order.GetAttribute("shipAddress"));
//Fast select by ID using XPathNavigator API
XPathNavigator nav = doc.CreateNavigator();
if (nav.MoveToId("id10535"))
  Console.WriteLine(nav.GetAttribute("shipAddress", ""));
//Fast select by ID using XPath
XPathNodeIterator ni = nav.Select("id('id10535')");
if (ni.MoveNext())
  Console.WriteLine(ni.Current.GetAttribute("shipAddress", ""))

Sounds good, but the following limitations apply:

DTD is absolutely required to make it work. No DTD—no IDs. If you want an element to have an ID, you have to define the element and its ID typed attribute in DTD. W3C XML Schema also supports the concept of IDs, but unfortunately due to historical reasons (XML Schema 1.0 Recommendation was published too late) W3C DOM2 (XmlDocument class) and W3C XPath 1.0 (XPathNavigator class) don't support schema-defined IDs. It's going to be fixed in the future, and current working drafts of the W3C XPath 2.0 and XQuery 1.0 do support DTD-defined IDs, schema-defined IDs, and even IDs that are defined by the soon to be approved xml:id standard predefined attribute, which would allow you to define IDs without any DTD or XML schema. Currently you can't work with schema-defined IDs when using XmlDocument or XPathNavigator classes, however. There is a hack to make it work, but it's just a hack.
An XML ID can be specified through an attribute value only. That means if you want to index <customer> elements in your XML document by the value of the child <name> element, you are out of luck with XML IDs. XML Schema allows such IDs, but again, schema-defined IDs aren't supported yet.
Moreover, ID value must conform to the XML Name grammar rule: no spaces, no slashes, starts with a letter, underscore or colon only. That means neither 12345 nor Maria Anders are valid XML ID values. The usual workaround is, of course, to make that data follow XML Name syntax—all those weird looking id12345 and Maria_Anders forms.
Each ID must obviously be unique across the whole XML document. That sounds quite natural, but when using IDs for just indexing and effective querying purposes, it's usually useful to have non-unique index values, for example when indexing by OrderDate value.
Each element can have at most one ID typed attribute, that is, a single ID only, so you cannot index orders both by OrderID and OrderDate values with XML IDs.
There is only one set of IDs for the whole XML document, so you can't separately index orders and parts with XML IDs.

Note The good thing is that you don't have to define complete a content model for your document in DTD—for the sake of defining IDs you only need to sketchily declare an element and its ID attribute. For example, here is a minimal form of the DTD for the document above:

<?xml version="1.0"?>
<!DOCTYPE orders [
  <!ELEMENT order ANY>  
  <!ATTLIST order
    orderno ID #REQUIRED   
  >  
]>
<orders>
  <order orderno="id10952" date="4/15/96" shipAddress="Obere Str. 57"/>
  <order orderno="id10535" date="6/13/95" shipAddress="Mataderos  2312"/>
</orders>

Note Many believe that an attribute named id or ID by itself has some special identification meaning in XML. That's not true. An attribute name on its own isn't important at all; what makes an attribute ID typed is obviously only an appropriate declaration in DTD, while its name can be any, such as ssn, id, OrderID, and so on.

Forthcoming new versions of W3C Recommendations should rectify points 1 and 2, but the rest of the limitations are basically inherit features of XML IDs and are not going to be changed. That's all we can get from XML IDs for the sake of indexing. Other facilities such as XSLT keys provide much more.

XSLT Keys

XSLT provides an extremely powerful and flexible indexing facility—XSLT keys, which generalize XML IDs and alleviate all previously mentioned limitations.

The idea behind XSLT keys is simple. They allow the stylesheet designer to identify "keyed" nodes in source XML document—nodes, which should be assigned a key. As well as with XML IDs, it's assumed that the XSLT processor will build some sort of index table (obviously with a one-time cost of building the index for each XML document) to allow fast access to such nodes not by traversing the XML tree, but directly through the index table.

Figure 2. Conceptual index table for a named key

Conceptually, a key is a triple, consisting of:

The key name.
The node to which the key is assigned.
The key value (arbitrary string).

Keys are declared using the <xsl:key> element:

<xsl:key name="key-name" match="pattern" use="expression"/>

All three attributes are required and the <xsl:key> element is always empty.

The semantics behind those attributes is:

name—defines the key name. The name should follow the XML QName (optionally prefixed XML Name) grammar rule.

match—defines which nodes should be assigned the key. It contains an XSLT Pattern used to match nodes.

use—any XPath expression used to define the value of the key for all matched nodes. This expression is evaluated once for each node matching the pattern. The key value is always a string, so the evaluation result is always converted to a string with one exception: if the result is a nodeset (for example, when use="keywords/keyword"), each node contributes separate key value—yes, each node can be assigned any number of key values and then retrieved by any of them.

key() Function

Given key declaration, you can efficiently query nodes by key value using the key() function. This function basically does for keys what the id() function does for XML IDs. The key() function takes two arguments:

key(key-name, key-value)

where key-name is a name of a key declared with <xsl:key> element(s) and key-value specifies the required value of the key.

The function returns all nodes to the same XML document as the context node, which has a value that is specified by key-name equal to the key-value (which is converted to a string if necessary).

Note, that if key-value is a nodeset, the function treats each node in the nodeset as a separate key value and returns a union of nodes, selected by each one.

XSLT keys have some great advantages over XML IDs:

Keys are declared in an XSLT stylesheet and not in an XML document. There is no need to modify an XML document to change the way it's indexed, and so each XML document can be indexed in any way—it's the stylesheet designer's freedom to declare keys that fit the transformation logic.
Keys allow indexing of any kind of nodes (except for namespace nodes)—not only elements.
Key value may be defined not only through an attribute, but also through a child element, through content, through any other node in any XML document, or it can even be generated on the fly (for example, with a generate-id() function). The only restriction (alleviated in XSLT 2.0) is that you cannot use variable references in both match and use attributes.
Key value is an arbitrary string, whether unique or not, so no limitations apply. For example, 12345 or Maria Anders are both valid key values.
Keys are named and so each key provides a separate independent index space. That means you can index orders and products and distinguish between them, limiting search scope and retrieving only nodes you are interested in.
Keys can greatly simplify stylesheet code by eliminating complex path expressions.
The possibilities are overwhelming and hardly limited. A node can be assigned any number of keys (the same node, same key name, different key values). A key value can refer to any number of nodes (the same key name, same key value, different nodes). And there may even be more than one key declaration with the same name.
Keys can be used to index nodes in any XML document, not only a principal source document.

There is an important difference between database indexing and indexing with XSLT keys. The XSLT processor builds indexes each time the transformation is performed—either at the beginning or at the first access, but for each transformation over again.

Conceptually, building an index involves traversing the whole XML document, matching every node against patterns in key definitions, and evaluating key values for any matched node. With a big enough XML document and complex key definitions, that might be quite a job to build such an index.

That's a cost you should consider. As always, there is a trade-off here between fast nodes querying and CPU cycles/memory needed to build an index. In short, using keys isn't always the best solution and it doesn't provide any immediate performance gains by itself—it depends, so always measure.

There used to be some problems (slow performance in particular) with XSLT keys when transforming big XML documents in Microsoft .NET. Microsoft rectified those issues in 2004 with the latest versions of the .NET Framework service packs. Also take a look at Knowledge Base article 324478 for additional info.

Indexing of In-Memory XML Stores

So XSLT keys are great and powerful facility, but what if you don't use XSLT in your application? Many applications store/edit/query XML in memory using XML stores provided by .NET—XmlDocument, XPathDocument, and XmlDataDocument classes. How would you index them?

Searching for a particular <order> element in an XmlDocument by an XPath expression like //order[@OrderID='12345'] might take quite a while because at worst case such search requires XPath engine to check every <order> element in the whole document. And what if you need repetitive searches in multiple threads simultaneously? That can be deadly slow and it won't scale. Such querying is a standard use case for indexing, but unfortunately .NET doesn't provide any means for dealing with this. That's basically left to extensions.

One way to implement such extensions was shown by Mark Fussel in a blog post called Indexing XML, XML ids and a better GetElementByID method on the XmlDocument class. It's a way to extend the XmlDocument class so that it recognizes any attribute named id as the XML ID (with no DTD or XML Schema involved).

Another, more generic and powerful approach I'm going to talk about is the IndexingXPathNavigator class, which enables indexing for any IXPathNavigable XML store by extending XPathNavigator with XSLT-like keys functionality.

IndexingXPathNavigator

The IndexingXPathNavigator class is a part of the Common module of the recently released freeware Mvp.Xml library, developed by Microsoft MVPs in XML and related technologies to supplement .NET functionality provided by the System.Xml and related namespaces. The IndexingXPathNavigator class can be found in the Mvp.Xml.Common.XPath namespace.

What it adds to the XPathNavigator is full-fledged XSLT keys functionality—it allows the declaring of keys, lazy or eager indexing, and querying nodes through the key() function—all with no XSLT involved, but following familiar and proved XSLT keys semantics.

IndexingXPathNavigator works on top of another XPathNavigator, augmenting it with the following indexing functionality:

The AddKey() method, which substitutes <xsl:key> declaration.
The BuildIndexes() method, which runs the indexing process. You can call it yourself (at initialization time, for example); otherwise, it will be called upon the first call to the key() function.
Overridden versions of the XPathNavigator.Compile() and XPathNavigator.Select() methods augment the XPath function library with key() function implementation.

As well as in XSLT, the first step is to declare keys according to which indexing should occur. This is done with the AddKey(string keyname, string match, string use) method, all three arguments of which literally follow the <xsl:key> element:

name—the name of the key,

match—an XSLT pattern defining the nodes this key should be assigned to, and

use—an XPath expression used to determine the value(s) of the key for each matching node.

For instance, the following code snippet shows how to create a key for <order> elements:

XmlDocument doc = new XmlDocument();
doc.Load("d:/orders.xml");
IndexingXPathNavigator inav = new 
    IndexingXPathNavigator(doc.CreateNavigator());
inav.AddKey("orderKey", "order", "@OrderID");

Obviously you can define as many keys as you need, including keys with the same name.

After all keys are declared, IndexingXPathNavigator is ready for indexing. The indexing process is performed as follows: each node in the XML document is matched against all key definitions. For each matching node, key value is evaluated and this [key value, node] pair is added to a Hashtable, associated with the appropriate named key.

As you can see, indexing isn't a cheap operation; it involves walking through the whole XML tree, multiple node matching, and XPath expression evaluations. That's the usual indexing trade-off.

The good thing is that, unlike the XSLT keys processing model, where indexes are scoped for a single transformation, you can build indexes with IndexingXPathNavigator only once and then query it endlessly. Or you can just leave indexing up to IndexingXPathNavigator and then indexing will be done (again, only once) the first time the key() function is called.

Note IndexingXPathNavigator currently doesn't support any index update strategy—when your underlying XML document is changed, you have to rebuild indexes calling the BuildIndexes() method.

And for fast node querying, IndexingXPathNavigator augments the XPath function library with an implementation of the standard XSLT key() function, which retrieves nodes directly from built indexes by the key value, in wonderful O(1) running time, obviously. The function is implemented fully as per XSLT specifications.

To show the performance and scalability improvements IndexingXPathNavigator can provide, I created a simple C# application, which selects 1000 randomly pre-selected orders out of XML document orders.xml, which is just XML version of the Orders table from the Northwind database. You can find the application in the article's download.

The orders.xml document has quite a flat structure and contains 415 orders:

<root>

<orders OrderID="10249" CustomerID="TOMSP" 
    EmployeeID="6"  OrderDate="1996-07-05T00:00:00" 
    RequiredDate="1996-08-16T00:00:00" ShippedDate="1996-07-10T00:00:00" 
    ShipVia="1" Freight="11.61" ShipName="Toms Spezialitäten" 
    ShipAddress="Luisenstr. 48" ShipCity="Münster" 
    ShipPostalCode="44087" ShipCountry="Germany" />
<!-- 414 more orders skipped -->
</root>

The sample application first counts orders, then selects 1000 random ones and populates an array of random order IDs. Then it measures the time required to select 1000 orders by these random order IDs using regular XPath selection from an XmlDocument, XpathDocument, and using IndexingXPathNavigator over XmlDocument and IndexingXPathNavigator over XPathDocument. Finally, to reveal scalability issues, I run the same application on XML documents containing 830 (x2), 1660 (x4), 3320 (x8) and 6640 (x16) orders.

Here is the application code:

using System;
using System.Xml;
using System.Xml.XPath;
using Mvp.Xml.Common.XPath;

namespace XML_Indexing_1
{
  /// <summary>
  /// Testing IndexingXPathNavigator speed.
  /// </summary>
  public class IndexingXPathNavigatorTest
  {
    public static int Main(string[] args) 
    {            
      if (args.Length < 1) 
      {
        Console.WriteLine("Source XML file name is required.");
        return -1;
      }
      //Load document
      XmlDocument xmlDoc = new XmlDocument();
      xmlDoc.Load(args[0]);   
      XPathDocument xpathDoc = new XPathDocument(args[0]);            

      //Count orders
      XPathNavigator nav = xpathDoc.CreateNavigator();
      int orderCount = (int)(double)nav.Evaluate("count(/root/orders)");
      Console.WriteLine("Number of orders in the document: {0}", orderCount);
      //Select 1000 random orders
      string[] randomOrderIDs = new string[1000];
      Random random = new Random();            
      for (int i=0; i<randomOrderIDs.Length; i++) 
      {
        int randomIndex = random.Next(1, orderCount);
        //Select this order's OrderID
        XPathNodeIterator order = nav.Select("/root/orders[" + randomIndex + "]/@OrderID");
        order.MoveNext();                
        randomOrderIDs[i] = order.Current.Value;
      }

      //Select from XmlDocument            
      SelectFromXmlDocument(randomOrderIDs, xmlDoc);
      //Select from XPathDocument                        
      SelectFromXPathDocument(randomOrderIDs, xpathDoc);
      //Select from XmlDocument using IndexingXPathNavigator            
      SelectFromIndexingXPathNavigator(randomOrderIDs, xmlDoc.CreateNavigator());
      //Select using IndexingXPathNavigator            
      SelectFromIndexingXPathNavigator(randomOrderIDs, xpathDoc.CreateNavigator());
      return 0;
    }

    /// <summary>
    /// Selects random orders from an XmlDocument.
    /// </summary>        
    public static void SelectFromXmlDocument(string[] randomOrderIDs, XmlDocument doc) 
    {
      PerfCounter perfCounter = new PerfCounter();
      perfCounter.Start();            
      for (int i=0; i<randomOrderIDs.Length; i++) 
      {                
        XmlElement order = (XmlElement)doc.SelectSingleNode("/root/orders[@OrderID='" + 
          randomOrderIDs[i] + "']");
        //Check if the order was selected
        if (order == null) 
          throw new InvalidOperationException("Order not found.");
      }
      perfCounter.Stop();
      Console.WriteLine("Selection from XmlDocument took {0:f6} ms in average", 
        perfCounter.Duration(randomOrderIDs.Length)/1000000);
    }

    /// <summary>
    /// Selects random order from an XPathDocument.
    /// </summary>        
    public static void SelectFromXPathDocument(string[] randomOrderIDs, 
      XPathDocument doc) 
    {   
      XPathNavigator nav = doc.CreateNavigator();
      PerfCounter perfCounter = new PerfCounter();
      perfCounter.Start();                        
      for (int i=0; i<randomOrderIDs.Length; i++) 
      {                
        XPathNodeIterator order = nav.Select("/root/orders[@OrderID='" + 
          randomOrderIDs[i] + "']");
        //Check if the order was selected
        if (!order.MoveNext()) 
          throw new InvalidOperationException("Order not found.");
      }
      perfCounter.Stop();
      Console.WriteLine("Selection from XPathDocument took {0:f6} ms in average", 
        perfCounter.Duration(randomOrderIDs.Length)/1000000);
    }

    /// <summary>
    /// Selects random order from an IndexingXPathNavigator.
    /// </summary>        
    public static void SelectFromIndexingXPathNavigator(string[] randomOrderIDs, 
      XPathNavigator nav) 
    {                        
      PerfCounter perfCounter = new PerfCounter();
      perfCounter.Start();                        
      IndexingXPathNavigator inav = new IndexingXPathNavigator(nav);
      inav.AddKey("orderKey", "orders", "@OrderID");            
      inav.BuildIndexes();            
      for (int i=0; i<randomOrderIDs.Length; i++) 
      {                
        XPathNodeIterator order = inav.Select(
          "key('orderKey', '" + randomOrderIDs[i] + "')");
        //Check if the order was selected
        if (!order.MoveNext()) 
          throw new InvalidOperationException("Order not found.");
      }
      perfCounter.Stop();
      Console.WriteLine("Selection from IndexingXPathNavigator took {0:f6} ms in average", 
        perfCounter.Duration(randomOrderIDs.Length)/1000000);
    }
  }
}

Pretty straightforward, as you can see. The PerfCounter class is an auxiliary class measuring timing with nanosecond accuracy, using the Win32 QueryPerformanceCounter and QueryPerformanceFrequency functions as described in How To: Time Managed Code Using QueryPerformanceCounter and QueryPerformanceFrequency.

You can run the test application from the command prompt as follows:

"XML Indexing 1.exe" orders.xml

Additionally you can test it with orders-x2.xml (830 orders), orders-x4.xml (1660 orders), orders-x8.xml (3320 orders), and orders-x16.xml (6640 orders) documents.

Here are the results of running above applications on my pretty archaic PIII-500 Windows 2000 workstation:

Table 1

Query method	Average order selection time (ms)
Query method	415 orders	830 orders	1660 orders	3320 orders	6640 orders
XmlDocument	5.561	10.46	22.469	46.09	87.93
XPathDocument	2.132	4.51	10.37	21.23	41.67
IndexingXPathNavigator over XmlDocument	0.505	0.667	0.70	0.838	1.232
IndexingXPathNavigator over XPathDocument	0.464	0.485	0.577	0.687	1.156

Figure 3. Performance and scalability test results.

Your results might differ in values, but the overall picture should be the same. Regular querying using XmlDocument and XPathDocument takes linear time—as the number of orders in a document doubles, the average order selection time doubles, too. And that's the best scalability XmlDocument and XPathDocument can provide; if the XML document is more complex and hierarchical, the selection time is worse.

However, querying using IndexingXPathNavigator is not only much faster than regular XPath path selection (in orders of magnitude), but, more importantly, its selection time is constant O(1) and doesn't depend on the XML document size. That's easy to explain—while regular XPath selection traverses the XML tree looking for a node (that's why it obviously slows down with bigger XML trees), IndexingXPathNavigator doesn't search the XML tree at all—instead it takes nodes directly from indexes by key name and value. That's why it's so fast and scalable, and that's why querying time doesn't depend on which underlying XML store is used—XmlDocument or XPathNavigator.

When building indexes, though, IndexingXPathNavigator has to traverse the whole tree, matching nodes and evaluating key values. This needs to be done once (or any time the underlying XML is changing). I deliberately included the indexing time (inav.BuildIndexes() call) into the overall IndexingXPathNavigator test timing to show that even with indexing time included it's still amazingly fast. In fact, as you can guess, a very slow, but still existing, average selection time growth on results from IndexingXPathNavigator (from 0.505 ms to 1.232 ms on a 16 times bigger XML document) can be explained by this one-time indexing, which basically takes a time comparable with a single regular XPath expression.

Using Keys with XSLT and IndexingXPathNavigator

Now that I'm done with explaining keys and IndexingXPathNavigator, let's discuss some straightforward and advanced ways to use them.

Using Keys for Grouping

XSLT 1.0 doesn't include any facilities for grouping, and traditionally grouping in XSLT is done using the Muenchian method, invented by Steve Muench. This method heavily relies on keys.

I'm not going to dive into the details of how the Muenchian method works (refer instead to the excellent explanation by Jeni Tennison); instead, I'll only show you a sample, which compares two methods of grouping orders in the orders.xml document by shipping country—the naïve way and the right way. The naïve way would be to perform multiple XML tree traversals—first to locate all orders that have no preceding ones with the same shipping country, and then to select all orders with the same shipping country value for each such order:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:template match="root">
    <table border="1">
      <tr>
        <th>Order ID</th>
        <th>Ship City</th>
      </tr>
      <xsl:for-each select="orders[not(@ShipCountry = preceding-sibling::orders/@ShipCountry)]">
        <tr>
          <th colspan="2">
            <xsl:value-of select="@ShipCountry"/>
          </th>
        </tr>
        <xsl:for-each select="/root/orders[@ShipCountry=current()/@ShipCountry]">
          <tr>
            <td>
              <xsl:value-of select="@OrderID"/>
            </td>
            <td>
              <xsl:value-of select="@ShipCity"/>
            </td>
          </tr>
        </xsl:for-each>
      </xsl:for-each>
    </table>

  </xsl:template>
</xsl:stylesheet>

It works, and takes 3.2 seconds on my machine to group 415 orders (according to the nxslt.exe utility timing). With 830 orders it takes already 26 seconds. I wouldn't even try it with 6640 orders! Here is a faster and more scaleable way, using classical Muenchian grouping with keys:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:key name="countryKey" match="orders" use="@ShipCountry"/>
  <xsl:template match="root">
    <table border="1">
      <tr>
        <th>Order ID</th>
        <th>Ship City</th>
      </tr>
      <xsl:for-each select="orders[generate-id()=generate-id(key('countryKey', @ShipCountry))]">
        <tr>
          <th colspan="2">
            <xsl:value-of select="@ShipCountry"/>
          </th>
        </tr>
        <xsl:for-each select="key('countryKey',@ShipCountry)">
          <tr>
            <td>
              <xsl:value-of select="@OrderID"/>
            </td>
            <td>
              <xsl:value-of select="@ShipCity"/>
            </td>
          </tr>
        </xsl:for-each>
      </xsl:for-each>
    </table>

  </xsl:template>
</xsl:stylesheet>

This one takes 0.15 seconds for 415 orders and 27 seconds for 6640 orders. Much better (but still too slow, because of a poorly optimized implementation of XSLT keys in .NET 1.X. For instance, MSXML does it in just 0.35 seconds. See Knowledge Base article 324478 for more info on how to fix it). As a matter of interest you can avoid that nasty orders[generate-id()=generate-id(key('countryKey', @ShipCountry))\] expression and replace it with much more readable set:distinct(orders/@ShipCountry) when using EXSLT's set:distinct() function, which selects nodes with a unique value. And in fact it works much faster, too:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:set="http://exslt.org/sets">
  <xsl:output indent="yes"/>
  <xsl:key name="countryKey" match="orders" use="@ShipCountry"/>
  <xsl:template match="root">
    <table border="1">
      <tr>
        <th>Order ID</th>
        <th>Ship City</th>
      </tr>
      <xsl:for-each select="set:distinct(orders/@ShipCountry)/..">
        <tr>
          <th colspan="2">
            <xsl:value-of select="@ShipCountry"/>
          </th>
        </tr>
        <xsl:for-each select="key('countryKey',@ShipCountry)">
          <tr>
            <td>
              <xsl:value-of select="@OrderID"/>
            </td>
            <td>
              <xsl:value-of select="@ShipCity"/>
            </td>
          </tr>
        </xsl:for-each>
      </xsl:for-each>
    </table>

  </xsl:template>
</xsl:stylesheet>

This one takes less than a second to group 6640 orders.

See Building Practical Solutions with EXSLT.NET for more insights on using EXSLT in .NET.

Obviously one can apply the same grouping techniques when working with plain XPath and IndexingXPathNavigator.

Multiple Keys

Any node can be assigned as many keys as needed, which means you can locate a node by any of the key values assigned to it. An example could be a list of orders, each of which can have more than one product child:

<orders>
  <order OrderID="10249" CustomerID="TOMSP" >
    <product ProductID="EVHF" price="9.99" quantity="10"/>
    <product ProductID="FOOB" price="5.99" quantity="5"/>
  </order>
  <order OrderID="15432" CustomerID="VICTE" >
    <product ProductID="EVHF" price="9.99" quantity="2"/>
  </order>
</orders>

Then to find any orders containing a given product, you'd need to index orders by child product IDs:

inav.AddKey("ordersKey", "order", "product/@ProductID");

And then you can get both first and second orders by the EVHF key value:

XPathNodeIterator orders =  inav.Select("ordersKey", "EVHF");

Another possibility is to index orders by both OrderID and CustomerID values:

inav.AddKey("ordersKey", "order", "@OrderID");
inav.AddKey("ordersKey", "order", "@CustomerID");

Then you can query an order by either known key value, by OrderID, or by CustomerID.

Note The above key definition is obviously the same as

inav.AddKey("ordersKey", "order", "@OrderID | @CustomerID");

Overloaded Keys

It's also possible to have more than one key definition with the same name. That might sound confusing, but in fact it's quite useful when you need to index different but semantically similar things, for example DVD and books by a title:

inav.AddKey("titleKey", "dvd", "title");
inav.AddKey("titleKey", "book", "bookinfo/title");

Then, given a title, you can query both DVD and books matching that title.

Indexing Multiple XML Documents in XSLT

In XSLT the key() function always selects nodes from the same document the context node belongs to. That allows indexing not only the principal source XML document, but any other document, too! All you need is to declare keys and switch context node before you call the key() function. For instance, here is how you can efficiently join orders and customers that are stored in separate documents:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:key name="customersKey" match="customers" use="@CustomerID"/>
  <xsl:variable name="customers" select="document('customers.xml')"/>
  <xsl:template match="root">
    <table>
      <tr>
        <th>Order ID</th>
        <th>Customer Company Name</th>
      </tr>
      <xsl:for-each select="orders">
        <tr>
          <td>
            <xsl:value-of select="@OrderID"/>
          </td>
          <td>
            <xsl:variable name="customerID" select="@CustomerID"/>
            <xsl:for-each select="$customers">
              <xsl:value-of select="key('customersKey', $customerID)/@CompanyName"/>
            </xsl:for-each>
          </td>
        </tr>
      </xsl:for-each>
    </table>

  </xsl:template>
</xsl:stylesheet>

Compound Keys

What if you need to query orders shipped to a particular country at a particular day? You can index by ShippedDate and by ShipCountry, but what about both? Keys allow doing that by compounding key values into a single one. For example, you can simply concatenate ShippedDate and ShipCountry values together, with some separator in between to get a new unique key value. In that case, only orders with both particular values will be selected. Here is how it can be done in XSLT:

<xsl:key name="ShipDateCountryKey" match="order" use="concat(@ShippedDate, '||', @ShipCountry)"/>
...
<xsl:for-each select="key('ShipDateCountryKey', concat($shipDate, '||', $shipCountry))">
And here is how it can be done with IndexingXPathNavigator:
IndexingXPathNavigator inav = 
    new IndexingXPathNavigator(xmlDoc.CreateNavigator());
inav.AddKey("ShipDateAndCountryKey", "orders", 
    "concat(@ShippedDate, '||', @ShipCountry)");
XPathNodeIterator orders = inav.Select(
    "key('ShipDateAndCountryKey', '1996-07-10T00:00:00||Germany')");
while (orders.MoveNext())
    Console.WriteLine(orders.Current.GetAttribute("OrderID", ""));

Obviously you need to make sure that the separator is safe and will never appear in the ShippedDate value.

Conclusion

The importance of fast XML querying grows quickly, as more and more data-centric applications adopt XML. One simple but effective way to find data quickly is data indexing. In this article I showed several techniques you can use to index stand-alone XML documents in order to get really fast access to indexed nodes. When used wisely, XML indexing can provide an amazing performance boost, measured in orders of magnitude.

And as XML comes to databases, XML indexing becomes a really hot topic. In the next part of the article I'm going to discuss even more useful and advanced XML indexing features introduced in the SQL Server 2005.

Feel free to comment or ask any questions about IndexingXPathNavigator at the mvp-xml-help mail list.