XML Reader with Bookmarks

 

Helena Kupkova
Microsoft Corporation

February 2005

Applies to:
   Microsoft .NET Framework

Summary: Helena Kupkova discusses the XmlBookmarkReader, which provides the ability to set bookmarks in an XML stream and then navigate between them. The XmlBookmarkReader combines random access to the XML with the XmlReader API. (9 printed pages)

Download the XmlBookmarkReader.exe sample.

Note   This download requires the Microsoft .NET Framework 1.0 be installed.

Contents

Introduction
A First Look at the XmlBookmarkReader
How the XmlBookmarkReader Works
Conclusion

Introduction

XmlBookmarkReader is an XmlReader that enables you to set a bookmark at an XML node, move the XmlReader ahead, and then go back to the bookmarked node and replay the XML content. The advantage of XmlBookmarkReader is that it enables efficient XML processing using the XmlReader API, without loading the whole XML document into memory for applications that need to scan ahead before processing the current position. It is also useful for applications that need to be able to step back a few nodes and "replay" them. Without the XmlBookmarkReader, such application would have to load the whole or part of the XML document into memory, using XmlDocument or XPathDocument.

A First Look at the XmlBookmarkReader

The XmlBookmarkReader is a subclass of XmlReader. It can be layered on other instances of XmlReader, such as XmlTextReader. The following table shows the methods added to the XmlReader in the XmlBookmarkReader:

Table 1. Methods in the XmlBookmarkReader

Method Description
SetBookmark(string) Creates a bookmark on the current node and assigns it a name.
ReturnToBookmark(string) Moves the reader back to a node that has a bookmark with the given name.
RemoveBookmark(string) Removes a bookmark with the given name. The XML nodes cached for this bookmark will be released for garbage collection unless they are needed by a preceding bookmark.
ReturnToAndRemoveBookmark(string) This method is a combination of the two previous methods. It moves the reader back to a node that has a bookmark with the given name and then removes the bookmark. As the reader moves ahead, the nodes cached for this bookmark will be released for garbage collection unless they are needed by a preceding bookmark.
RemoveAllBookmarks() Removes all bookmarks. All cached XML nodes will be released for garbage collection.

The following example uses the XmlBookmarkReader to select a list of elements based on the value of one of its child nodes. It uses a sample document that describes a few books from Dare Obasanjo's article about XPathReader:

<books>
  <book publisher="IDG books" on-loan="Sanjay">
    <title>XML Bible</title>
    <author>Elliotte Rusty Harold</author>
  </book>
  <book publisher="Addison-Wesley">
    <title>The Mythical Man Month</title>
    <author>Frederick Brooks</author>
  </book>
  <book publisher="WROX">
    <title>Professional XSLT 2nd Edition</title>
    <author>Michael Kay</author>
  </book>
  <book publisher="Prentice Hall" on-loan="Sander" >
   <title>Definitive XML Schema</title>
   <author>Priscilla Walmsley</author>
  </book>
  <book publisher="APress">
   <title>A Programmer's Introduction to C#</title>
   <author>Eric Gunnerson</author>
  </book>
</books>

Let's say we want to write out all the books that have "XML" in the title. That can be expressed by an XPath expression:

/books/book[contains(title, 'XML')]

We cannot use this XPath expression with XPathReader because it is not a sequential XPath. It requires a look ahead to the value of the title element before deciding whether to write out the particular book. That is where the XmlBookmarkReader comes in handy. Here is code that uses XmlBookmarkReader to get the list of books that have "XML" in the title:

using System; 
using System.IO;
using System.Xml;
using Microsoft.Samples;

public class Test {
    static void Main() {
        try { 
            // Create the bookmark reader
            XmlTextReader tr = new XmlTextReader( "books.xml" );
            tr.WhitespaceHandling = WhitespaceHandling.None;
            XmlBookmarkReader br = new XmlBookmarkReader(tr);

            // Create XmlTextwriter for the results
            XmlTextWriter writer = new XmlTextWriter( Console.Out );
            writer.Formatting = Formatting.Indented;
            writer.WriteStartElement( "selectedBooks" );

            while ( br.Read() ) {
                if ( br.NodeType == XmlNodeType.Element 
                        && br.Name == "book" ) {
                    // Set a bookmark on the <book> element
                    br.SetBookmark( "bookmark" );

                    // Read ahead until the <title> element
                    while ( br.Read() && br.Name != "title" ) ;

                    // Read the value of <title> element
                    string title = br.ReadElementString();

                    // check if it contains "XML"
                    if ( title.Contains( "XML" ) ) {
                        // go back to the bookmark on <book> and write
                        // out the whole element
                        br.ReturnToAndRemoveBookmark( "bookmark" ); 
                        writer.WriteNode( br, true );
                    }
                    else {
                        // remove the bookmark, we no longer need it                                   
                        br.RemoveBookmark( "bookmark" );
                    }
                }
            }
            writer.WriteEndElement();
            writer.Close();
        }
        catch( XmlException xe ) {
            Console.WriteLine( "XML Parsing Error: " + xe );
        }
        catch( IOException ioe ) {
           Console.WriteLine( "File I/O Error: " + ioe );
        }
    }
}  

The code writes out the following:

<selectedBooks>
  <book publisher="IDG books" on-loan="Sanjay">
    <title>XML Bible</title>
    <author>Elliotte Rusty Harold</author>
  </book>
  <book publisher="Prentice Hall" on-loan="Sander">
    <title>Definitive XML Schema</title>
    <author>Priscilla Walmsley</author>
  </book>
</selectedBooks>

How the XmlBookmarkReader Works

Coding XmlBookmarkReader was a nice exercise on working with linked lists. I used linked lists to store the cached XML nodes, and also the list of namespaces in the scope of each node.

Cached Nodes

When the SetBookmark method is called to create a bookmark on the current node, you need to cache the node and any nodes following it whenever you move ahead, such as when the Read method is called. All the nodes between the bookmark and the current position need to be cached, so we can go back later and replay them.

The cached nodes are stored in a linked list of CachedXmlNode instances. Each CachedXmlNode holds information about a single node, such as its name, prefix, value, or depth. It also refers to a list of attributes the node has (field attributes), which is again a list of CachedXmlNode instances. There is no need to build a more elaborate structure for the cached nodes such as tree or XmlNode, because all we need from it is to be able to sequentially replay them. A linked list is an ideal data structure for this.

Bookmarks

Bookmarks are stored in a Hashtable. A bookmark is hashed by its name, and each name maps to the CachedXmlNode the bookmark was set on. When the ReturnToBookmark method is called to move the reader to a bookmark, the reader gets the cached XML node that is associated with the bookmark name from the Hashtable, and sets it as the current node. The reader is now positioned on the bookmarked node. The following Read calls will replay the cached XML nodes, starting with the node that immediately follows the bookmarked node.

Example

Let's take a look at the following example:

<root xmlns="rootNs">
  <A xmlns:p="ns1" p:attr="value">text of A</A>
  <B xmlns:p="ns2" ><p:C/></B>
</root>

The following code parses this XML document, sets a bookmark on the elements A and B, and stops on the p:C element:

using System.Xml;
using Microsoft.Samples;

public class Test {
    static void Main() {
        XmlTextReader tr = new XmlTextReader( "sample.xml" );
        tr.WhitespaceHandling = WhitespaceHandling.None;
        XmlBookmarkReader br = new XmlBookmarkReader( tr );
        // move to <A> element and set a bookmark
        br.Read();
        br.Read();
        br.SetBookmark( "b1" );
        // move to <B>
        while ( br.Read() && br.LocalName != "B" ) ;
        // set a bookmark
        br.SetBookmark( "b2" );
        // move to <p:C>
        br.Read();
        ...
    }
}

Figure 1 shows how the list of cached XML nodes and the hashtable with bookmarks look inside XmlBookmarkReader at the end of the Main method.

Click here for larger image.

Figure 1. Linked lists inside XmlBookmarkReader

Note that the <root> element is not in the list, since there was no need to cache it. At the end of the Main method the inner XmlTextReader is positioned on the p:C element, so the end element nodes for B and <root> are not cached either.

Garbage Collection Does the Cleanup Work for Us

If the RemoveBookmark method is called at this point to remove the bookmark b1, all nodes between <A> and </A> will be left for garbage collection to clean up (see Figure 2). This is because the hashtable entry for bookmark b1 is the only place that keeps a reference to the cached element <A> and its children. The element <B> will not be released, though, because there is still a hashtable entry for bookmark b2 that references it.

Click here for larger image.

Figure 2. Linked lists inside XmlBookmarkReader after RemoveBookmark("b1")

The bookmark b1 is removed by calling Remove("b1") on the hashtable, which clears the hashtable entry for b1. That is all that is needed for the cleanup—the garbage collection will take care of the rest. Now imagine how much cleanup code we would need in C++!

Namespaces

Another linked list used by the XmlBookmarkReader is for keeping track of namespace declarations. It is in fact a tree, but each item has just a single reference to its parent, so in many cases it behaves like a list.

So how do the namespaces work? You probably know that each element can have one or more namespace declarations. These declarations are in scope only for that element and its descendants. In the XmlReader API the declarations can be accessed using the LookupNamespace method, which enables you to look up a namespace URI for a prefix.

Obviously the method can return different results on different nodes. In the example above LookupNamespace for p would return ns1 on element <A>, ns2 on element <B>, and null on the <root> element. When the XmlBookmarkReader moves back to a bookmark on the element <A>, the LookupNamespace method needs to return the correct value. That means that we need to preserve the namespaces in scope for each cached node.

Each XmlCachedNodes therefore keeps a reference to a list of NamespaceDecl instances, which represents the namespaces in scope for this node. If an element does not declare any new namespaces, it will refer to the same namespace list as its parent. If it does declare any new namespaces, however, it will have its own NamespaceDecl entries for its declarations, which will then link to the parent's list.

Note   The only reason why we do all the work with the namespaces is for the LookupNamespace method; it is not needed for replaying prefixes and namespace URIs of elements and attributes. Since the namespace tracking has certain performance overhead, you may want to choose to disable the namespace tracking if your application does not use the LookupNamespace method. To do that just comment out the ProcessNamespaces method in the XmlBookmarkReader source code.

Figure 3 shows how the cached node maps to its namespace declarations at the end of the Main method. Notice how the list forks and basically changes to a tree when there are two or more sibling elements with namespace declarations.

Click here for larger image.

Figure 3. Linked lists inside the XmlBookmark reader including the namespace declaration

Note   The XmlBookmarkReader does not support replaying of all the properties XmlReader exposes for each node. For example it does not cache BaseURI, QuoteChar, XmlSpace, or XmlLang of each node. Support for this can be easily added by adding more fields to XmlCachedNode. Also, its ReadAttributeValue method does not iterate over the text and entity references in an attribute value. It returns only one text node with the whole Value of the attribute.

Conclusion

The XmlBookmarkReader is a useful component for processing XML. It combines the flexibility of the pull-based XML parser model of XmlReader with bookmarking capability and the ability to go back to XML nodes that have been already parsed. It is not a replacement for an XML store such as XmlDocument or XPathDocument, but it is extremely useful when the range of nodes that the application needs to access, either backwards or forwards, is relatively small.