XPath Querying Over DataSets with the DataSetNavigator

Article
09/04/2007

Arpan Desai
Microsoft Corporation

August 2004

Applies to:
XML
DataSetNavigator

Summary: Arpan Desai discusses the DataSetNavigator, which provides the power and flexibility of the XML programming model without the overhead of having to convert the entire DataSet into an XmlDataDocument object. (7 printed pages)

Click here to download the code sample for this article.

Introduction
Building an XPath Navigator over the DataSet
The DataSetNavigator Implementation Details
Usage
Conclusion

Introduction

For a while now, Dare has been asking me to write an article for MSDN. Now that Microsoft Visual Studio 2005 Beta 1 has (finally) shipped, I've managed to find some time to write to tinker around with an idea I've had for a while: an XPathNavigator over a DataSet.

The XmlDataDocument was originally envisioned as a component which allowed users to have an editable, hierarchical view of Microsoft ADO.NET DataSet data. The main use case has been the ability to perform an XSLT transformation over a DataSet in order to generate HTML for presentation on a webpage. Unfortunately, the performance of the XmlDataDocument is often a bottleneck during XSLT processing.

One of the workarounds to this issue is to utilize the WriteXml() method on the DataSet and load the serialized contents into the XmlDocument or XPathDocument class. Although performance typically improves using this method, it is still non-optimal due to the need to serialize and reparse the XML data. In addition, the data being transformed or queried is now residing in both the DataSet and the XmlDocument/XpathDocument, which means a substantial increase in memory usage. The ideal solution would have the performance of a native XML store such as the XPathDocument while having minimal memory overhead.

This article describes an implementation of a class called the DataSetNavigator, which attempts to solve this problem in an ideal manner.

Building an XPath Navigator over the DataSet

The XPathNavigator is a read-only cursor over XML data sources. An XML cursor acts like a lens that focuses on one XML node at a time, but unlike pull-based APIs like the XmlReader, the cursor can be positioned anywhere along the XML document at any given time. The XPathNavigator is a good candidate for implementing an XML façade over non-XML data because it allows one to construct the XML views of a data source in a just-in-time manner, without having to convert the entire data source to an XML tree.

With the appropriate implementation, you could use it to query the file system, the Windows registry, an Active Directory store, or any other kind of hierarchical data store. In a previous article Steve Saxon created an XPathNavigator over object graphs using the ObjectXPathNavigator.

The XslTransform class in the Microsoft .NET Framework utilizes the XPathNavigator to perform XSLT processing over these sources. In addition, the XPathNavigator API also enables XPath querying. By implementing the DataSetNavigator, we can perform both XSLT transformations and XPath queries over the contents of a DataSet.

A question which must be asked is this: why would the implementation of the DataSetNavigator be any faster than the XmlDataDocument? One of the fundamental reasons why the XmlDataDocument has performance issues is its ability to be editable. Anytime a change occurs in an instance of the XmlDataDocument, the change propagates to the DataSet it is associated with. Conversely, any change to the DataSet synchronizes the attached XmlDataDocument. Even when changes are not being made, the overhead of such a design prevents optimal performance during query or transformation. In both these scenarios, there is no need for editing; therefore we can create a more efficient XPathNavigator implementation by omitting the overhead of the synchronizing functionality.

The DataSetNavigator Implementation Details

The objective is quite simple: Make a fast, lightweight XPathNavigator implementation on top of a DataSet. The most complex feature that will be implemented is the ability to understand nested relationships within the DataSet so that hierarchies can be represented by means of the DataSetNavigator.

The first step in implementing the DataSetNavigator is understanding how the underlying data source will be modeled by the XPathNavigator. When no XML schema is provided to the DataSet, the rules used by the XmlDataDocument and the WriteXml() method to generate XML are quite simple.

The name of the DataSet is the root element in the XML.
The rows of any tables that are not children in a nested relationship become the child elements of the root element.
The columns of the tables become child elements within each row element.
If there are any nested child rows, those also become child elements of the row element.

Conceptually, the implementation of the DataSetNavigator consists of two main parts. The first part is the ability to generate a simple tree, based on the contents of a DataSet. The second part is the ability to navigate this tree by implementing an XPathNavigator. In the implementation provided, the DataSetNavigator generates the tree when it is instantiated by a user, which results in a higher upfront cost to creating a DataSetNavigator but a much lower cost during actual use.

The tree is constructed from a series of nodes, all of which derive from the DataSetNode class. The abstract DataSetNode class has the basic navigation required to traverse the generated tree and has a few members of importance:

parent—The parent member points to the parent DataSetNode of the current node.
children—The children member is an array of DataSetNodes that represents the children of the current node.
siblingPosition—The siblingPosition member represents the zero-based index of the current node position relative to its siblings. This is necessary so moving between siblings can be implemented simply and efficiently.
localName—The localName member points to an atomized string which is the local name of the current node. XPathNavigators utilize the XmlNameTable for atomization and exposure of the local names, namespace prefixes, and namespace URIs within a given document. This allows for cheap object reference comparisons to occur when searching for these items rather than expensive string equality comparisons.

Our actual tree is built from specialized classes which are derived from DataSetNode. These specialized classes correspond to the different types of positions the DataSetNavigator abstracts from the DataSet. Based on the XML serialization rules stated above, we can have three position types of interest:

TopNode—This is the top level element node exposed by the XPathNavigator that is positioned on a DataSet.
TableRowNode—This element node is positioned on individual rows within the DataSet.
ColumnRowNode—This element node is positioned on a specific column on a specific row.

These three node types cover the different positions available within the DataSet. There are two additional DataSetNode derived types that do not represent positions in a DataSet, but are necessary to finish off the DataSetNavigator.

RootNode—Every document has a root; this node is an extremely simple node type which represents this root.
CellValueNode—We need a way to represent the actual data that resides in the DataSet. Every ColumnRowNode in the tree has a single CellValueNode child that contains this data. Note that the implementation of CellValueNode does not result in the duplication of data already in the DataSet, but instead returns the DataSet instances.

Usage

Usage of the DataSetNavigator is similar to any other XPathNavigator. The DataSetNavigator can be passed to the XslTransform class for XSLT processing as shown in the example below:

DataSet myDataSet = new DataSet();
...
DataSetNavigator myNavigator = new DataSetNavigator(myDataSet);
XslTransform myTransform = new XslTransform();
myTransform.Load("transform.xsl");
myTransform.Transform(myNavigator, null, Console.Out);

Alternatively, XPath queries can be executed against the DataSet as the following example demonstrates:

using System; 
using System.Data;
using System.Xml; 
using System.Xml.XPath;
using Microsoft.Xml; 

public class DataSetNavTest{


  public static DataSet CreateDataSet(){

    DataSet custDS = new DataSet("Books");
    
    DataTable ordersTable = custDS.Tables.Add("Book");
    
    DataColumn pkCol = ordersTable.Columns.Add("BookID", typeof(Int32));
    ordersTable.Columns.Add("Title", typeof(string));
    ordersTable.Columns.Add("Quantity", typeof(int));
    ordersTable.Columns.Add("UnitPrice", typeof(decimal));
    ordersTable.Columns.Add("Category", typeof(string));
    
    ordersTable.PrimaryKey = new DataColumn[] {pkCol}; 
    
    ordersTable.Rows.Add(new Object[]{101,
     "Quantum Physics for Beginners", 2, 50, "Science"}); 
    ordersTable.Rows.Add(new Object[]{201, 
     "Repair your car with twine", 100, 24.99, "Automotive"}); 
    ordersTable.Rows.Add(new Object[]{301, 
  "The Secret of Life, The Universe and Everything", 42, 41.99, "Humor"}); 

    custDS.AcceptChanges(); 

    return custDS;
  }

  public static void Main(string[] args){

    DataSet ds = CreateDataSet(); 
    Console.WriteLine(ds.GetXml());

    DataSetNavigator nav = new DataSetNavigator(ds); 
    XPathNodeIterator iter = 
     nav.Select( "/Books/Book[(UnitPrice * Quantity) > 1000]" );

    while( iter.MoveNext() ){

   /* print title and total price of order using XPath queries*/ 
   string itemTotal = 
    iter.Current.Evaluate( "UnitPrice * Quantity").ToString();
   string itemTitle = 
    iter.Current.Evaluate("string(Title)").ToString();    

   Console.WriteLine("+{0} = {1}", itemTitle, itemTotal );

   /* print title and total price of order using DataSet APIs */ 
   DataSetNavigator nav2 = (DataSetNavigator) iter.Current.Clone(); 
   DataRow row           = nav2.GetDataRow(); 

   string itemTotal2 = 
   (((decimal)row["UnitPrice"]) * ((int)row["Quantity"])).ToString();
   string itemTitle2 = row["Title"].ToString();    

   Console.WriteLine("-{0} = {1}", itemTitle2, itemTotal2 );

      }
  }

}

The DataSetNavigator implements the IDataSetPosition interface in addition to being an XPathNavigator. This interface is designed to return the position within the DataSet where the current DataSetNavigator is positioned, which allows for advanced scenarios in which accessing the actual DataRows/DataColumns of the DataSet is necessary. The IDataSetPosition exposes the GetPositionType() method that returns an enumeration value of the current state:

DataSetPosition.DataSet –The DataSetNavigator is positioned on the DataSet. Calling GetDataSet() will successfully return the current DataSet. Calling GetDataRow() or GetDataColumnIndex() will fail since the navigator is not positioned on a specific row or column.
DataSetPosition.Row—The DataSetNavigator is current positioned on a specific DataRow. Calling GetDataSet() will return the DataSet of which the DataRow is currently part. Calling GetDataRow() will return the current DataRow. Calling GetDataColumnIndex() will result in an error.
DataSetPosition.Cell—This denotes that the DataSetNavigator is positioned on a specific cell. A cell is considered to be an intersection of a single DataRow and a single DataColumn. Calling GetDataSet(), GetDataRow(), or GetDataColumnIndex() will all succeed in this case.

Conclusion

In preliminary testing, the DataSetNavigator is anywhere from 2x-50x faster than running the same transformation on an XmlDataDocument, depending on the size of the data being transformed and the complexity of the stylesheet. The DataSetNavigator is also 2x-5x faster than the XmlDocument or XPathDocument workaround currently available, due to the fact that reparsing the contents of the DataSet does not occur.

The current DataSetNavigator is fairly simple and does not incorporate some interesting features. One feature that is lacking is the support for namespaces. The current methods is converting a DataSet to XML support namespaces to be serialized when they are associated with various parts of the DataSet. The current DataSetNavigator doesn't surface these namespaces. This functionality would be fairly straightforward to add and has been left as an exercise for the interested reader.