Some Fun with SAX
August 22, 2000
OASIS Conformance Test Harness
XML Statistics Package
I now know the secret of generating a massive number of reader comments. Just write an article about cool new technologies that the readers can't get their hands on yet. I hope you've all recovered now that the .NET Framework SDK Technology Preview is available.
In this article, I want to take a close look at the Visual Basic® SAX interface included in the July 2000 Microsoft XML Parser Beta Release. I decided to write some Visual Basic code, for a change, and I ended up writing quite a lot of code. I also gave the MSXML2.VBSAXXMLReader30 class a really good work out. The application looks like this:
OASIS Conformance Test Harness
I started off by writing a Visual Basic test harness for running the OASIS Conformance Test Suite. I needed this for another project I was working on, and I thought I'd kill two birds with one stone.
The test harness loads a big XML document, which is the index of all tests to run. Each test is listed like this:
<TEST TYPE="not-wf" ENTITIES="none" ID="not-wf-sa-001" URI="not-wf/sa/001.xml" SECTIONS="3.1 "> Attribute values must start with attribute names, not "?". </TEST>
Each TEST element contains the following attributes:
|TYPE||not-wf||Where the parser is supposed to report a well-formed error|
|invalid||Where validating parsers are supposed to report validation errors, and non-validating parsers are supposed to pass these tests|
|valid||Where both validating and non-validating parsers are supposed to pass these tests|
|ENTITIES||none||Whether the test requires support for loading entities|
|ID||not-wf-sa-001||The unique test identifier|
|URI||not-wf/sa/001.xml||The location of the actual XML test file to parse|
|SECTIONS||3.1 ||A reference to the relevant section in the XML 1.0 spec|
The main entry point to the OasisTest class module takes a URL pointing to the master xmlconf.xml index document:
Public Sub run(testurl As String)
When this is called, I load up this test index into a DOMDocument object, select all the TEST elements, and then call my SAX test code with the information about each test.
Dim doc As DOMDocument Dim node As IXMLDOMElement Dim tests As IXMLDOMNodeList Set doc = New DOMDocument30 doc.async = False Set tests = doc.selectNodes("//TEST") Set node = tests.nextNode() While Not node Is Nothing And Not Cancel RunTest(node) node = tests.nextNode() Wend
I also create an empty document, which will contain a log of all the test results. When the user clicks the Generate Report button, this document is transformed using the template.xsl style sheet to display the final test report.
To actually run the test with the MSXML2.VBSAXXMLReader object, I use the following method:
Public Sub RunTest(docBase As String, element As IXMLDOMElement)
First, this method creates a new SAX reader object, and configures that object to process external entities and call back on my implementations of the IVBSAXContentHandler, IVBSAXDTDHandler, and IVBSAXErrorHandler interfaces.
Dim ContentHandler As ContentHandler Set ContentHandler = New ContentHandler Dim reader As VBSAXXMLReader30 Set reader = New VBSAXXMLReader30 reader.putFeature "http://xml.org/sax/features/external-parameter-entities", True reader.putFeature "http://xml.org/sax/features/external-general-entities", True Set reader.contentHandler = ContentHandler Set reader.errorHandler = ContentHandler Set reader.dtdHandler = ContentHandler
Notice that it's actually quite convenient to implement all three handler interfaces on one class, the ContentHandler class.
To kick off the actual parsing of the test file, I simply call the parseURL method.
Then I check the results, compare the output against the expected output, and so on.
The ContentHandler class module, which implements the SAX callback interfaces, starts out like this:
Option Explicit Implements IVBSAXContentHandler Implements IVBSAXDTDHandler Implements IVBSAXErrorHandler
ContentHandler then implements all the methods defined on these interfaces. The bulk of the code in this class deals with generating a canonical output of the XML, which can then be used for comparison against the expected output. It includes the following sorts of things:
- Escaping all the special characters as entities. This includes & < > " and also escaping the newline characters as and . It is just how the expected output files in the test suite are stored, so I have to do this also.
- Sorting the attributes. Since attributes are order independent, and some parsers return default attributes in a different order from others, this guarantees that the order of the attributes matches the expected output files. To do this, I used a Visual Basic QuickSort algorithm I found on MSDN. See the Sorter.cls module.
- Saving notation declarations. These come from the IVBSAXDTDHandler interface. They need to be saved, then sorted for comparison against the expected output. Storing and sorting of the notations is done in the DocType.cls module.
- Catching and storing the error information. This is done in the implementation of the IVBSAXErrorHandler fatalError method.
XML Statistics Package
Another fun thing to do with SAX-level XML processing is to count elements and attributes. I wrote another simple IVBSAXContentHandler implementation that counts the number of elements, attributes, text nodes, text chars, and name chars—and displays a "tagginess ratio," which is an indication of how much markup is in the file relative to actual text content. As you can see below, the hamlet.xml file is quite taggy.
When you play with this, you definitely get a feel for how snappy SAX-level processing can be. It is certainly a lot faster than loading the DOMDocument object model and walking the tree to calculate all this.
This class module implements the IVBSAXContentHandler and IVBSAXErrorHandler SAX callback interfaces by incrementing a set of counters based on which method is called. For example, the startElement method does the following:
Private Sub IVBSAXContentHandler_startElement( ByVal strNamespaceURI As String, ByVal strLocalName As String, ByVal strQName As String, ByVal Attributes As IVBSAXAttributes) Dim i As Integer Elements = Elements + 1 NameChars = NameChars + Len(strQName) AttributeNodes = AttributeNodes + Attributes.length For i = 0 To Attributes.length - 1 NameChars = NameChars + Len(Attributes.getQName(i)) TextChars = TextChars + Len(Attributes.getValue(i)) Next End Sub
While I was coding the Filter IVBSAXContentHandler implementation, it occurred to me that adding filtering operations would not be much more work at all. The Filter tab contains the following options:
Here, I have selected the options to load hamlet.xml and convert all the element and attribute names to proper case; for example, <PERSONA> becomes <Persona>, and so forth.
SAX-level processing also enables you to format an XML document by adding new lines and indentation based on nesting level. You can control the indentation amount and whether to use the space or tab character for indenting. When formatting is set to Indented, the following input:
<test> <item></item> <name> <first>Chris</first> <last>Lovett</last> </name> </test>
The algorithm for indenting works by keeping a stack of integers representing the "content" model at each level of the document. The possible values are:
Const CONTENT_EMPTY = 0 Const CONTENT_MIXED = 1 Const CONTENT_ELEMENT = 2
The content model for a new element starts out as CONTENT_EMPTY. When the IVBSAXContentHandler_characters method is called, the content model for the current element is set to CONTENT_MIXED. If the content is not already mixed when a child element is started, the content becomes CONTENT_ELEMENT.
I have not fully tested this code, so I advise against using this for industrial strength applications. However, it seems to do a pretty nice job most of the time. There are plenty of things you could add to this. For example, you will notice that the empty element <item/> was output as <item></item>. This was simply because I was too lazy to delay writing the ">" character until the endelement event. This makes the code just a little messy, because you have to remember to write out the ">" character before writing any text content or child elements.
Attributes to Elements
Lastly, this little check box causes all attributes to be written out as child elements. For example, when this is turned on, the following XML:
<row au_id='998-72-3567' au_lname='Ringer' au_fname='Albert' phone='801 826-0752' address='67 Seventh Av.' city='Salt Lake City' state='UT' zip='84152' contract='True' '/>
will become (when indenting is also turned on):
<row> <au_id>998-72-3567</au_id> <au_lname>Ringer</au_lname> <au_fname>Albert</au_fname> <phone>801 826-0752</phone> <address>67 Seventh Av.</address> <city>Salt Lake City</city> <state>UT</state> <zip>84152</zip> <contract>True</contract> </row>
This is done simply by the following code:
If (AttrsToElements) Then For i = 0 To Attributes.length - 1 Name = Attributes.getQName(i) If (Name <> "xmlns" And Mid(Name, 1, 6) <> "xmlns:") Then content(Level) = CONTENT_ELEMENT Call WriteIndent(Level, content(Level)) OutputStream.Write ("<" & FilterName(Name) & ">") OutputStream.Write (EscContent(Attributes.getValue(i))) OutputStream.Write ("</" & FilterName(Name) & ">") End If Next End If
The Visual Basic SAX interface included in the July 2000 Microsoft XML Parser Beta Release makes writing high-performance, stream-level XML processing applications pretty easy. It took me about a day to throw together these little samples, and I had a lot of fun. I hope you enjoy using SAX, too.
Chris Lovett is a program manager for Microsoft's XML team.