A Beginner's Guide to the XML DOM

 

Brian Randell
DevelopMentor

October 1999

Summary: This article discusses how to access and manipulate XML documents via the XML DOM implementation, as exposed by the Microsoft® XML Parser. (10 printed pages)

Contents

Introduction What Exactly is a DOM How Do I Use the XML DOM How Do I Load a Document Dealing with Failure Retrieving Information from an XML Document How Do I Traverse a Document Now What

Introduction

You are a Visual Basic® developer and you receive some data in the form of an eXtensible Markup Language (XML) document. You now want to get the information from the XML document and integrate that data into your Visual Basic solutions. You could of course write code yourself to parse the contents of the XML file, which after all is just a text file. However, this isn't very productive and negates one of the strengths of XML: that it is a structured way to represent data.

A better approach to retrieving information from XML files is to use an XML parser. An XML parser is, quite simply, software that reads an XML file and makes available the data in it. As a Visual Basic developer you want to use a parser that supports the XML Document Object Model (DOM). The DOM defines a standard set of commands that parsers should expose so you can access HTML and XML document content from your programs. An XML parser that supports the DOM will take the data in an XML document and expose it via a set of objects that you can program against. In this article, you will learn how to access and manipulate XML documents via the XML DOM implementation, as exposed by the Microsoft® XML Parser (Msxml.dll).

Before you read any further, you should look at a raw XML file to get an idea of how a parser can make your life easier. The following code exposes the content of the file Cds.xml that contains compact disc items. Each item contains information such as the artist, title, and tracks.

<?xml version="1.0"?>
<!DOCTYPE compactdiscs SYSTEM "cds.dtd">
<compactdiscs>
  <compactdisc>
    <artist type="individual">Frank Sinatra</artist>
    <title numberoftracks="4">In The Wee Small Hours</title>
   <tracks>
      <track>In The Wee Small Hours</track>
      <track>Mood Indigo</track>
      <track>Glad To Be Unhappy</track>
      <track>I Get Along Without You Very Well</track>
   </tracks>
    <price>$12.99</price>
  </compactdisc>
  <compactdisc>
    <artist type="band">The Offspring</artist>
    <title numberoftracks="5">Americana</title>
   <tracks>
      <track>Welcome</track>
      <track>Have You Ever</track>
      <track>Staring At The Sun</track>
      <track>Pretty Fly (For A White Guy)</track>
   </tracks>
    <price>$12.99</price>
  </compactdisc>
</compactdiscs>

The second line of the previous document references an external DTD, or Document Type Definition file. A DTD defines the layout and expected content for a particular type of XML document. An XML parser can use a DTD to determine if a document is valid. DTDs are just one way you can help a parser validate your documents. Another increasingly popular method to validate documents is XML Schemas. You define schemas using XML in contrast to DTDs, which use their own "interesting" syntax.

The following code displays the contents of Cds.dtd used by Cds.xml:

<!ELEMENT compactdiscs (compactdisc*)>
   <!ELEMENT compactdisc (artist, title, tracks, price)>
      <!ENTITY % Type "individual | band">
      <!ELEMENT artist (#PCDATA)>
         <!ATTLIST artist type (%Type;) #REQUIRED>
      <!ELEMENT title (#PCDATA)>
         <!ATTLIST title numberoftracks CDATA #REQUIRED>
      <!ELEMENT tracks (track*)>
      <!ELEMENT price (#PCDATA)>
      <!ELEMENT track (#PCDATA)>

This article doesn't go in depth about DTDs and XML Schemas. The XML Schema Reference is based on the XML-Data note submitted to the W3C.

What Exactly is a DOM?

A DOM for XML is an object model that exposes the contents of an XML document. The W3C's Document Object Model (DOM) Level 1 Specification currently defines what a DOM should expose as properties, methods, and events. Microsoft's implementation of the DOM fully supports the W3C standard and has additional features that make it easier for you to work with XML files from your programs.

How Do I use the XML DOM?

You use the XML DOM by creating an instance of an XML parser. To make this possible, Microsoft exposes the XML DOM via a set of standard COM interfaces in Msxml.dll. Msxml.dll contains the type library and implementation code for you to work with XML documents. If you're working with a scripting client, such as VBScript executing in Internet Explorer, you use the DOM by using the CreateObject method to create an instance of the Parser object.

Set objParser = CreateObject( "Microsoft.XMLDOM" )

If you are using VBScript from an Active Server Page (ASP), you use Server.CreateObject.

Set objParser = Server.CreateObject( "Microsoft.XMLDOM" )

If you're working with Visual Basic, you can access the DOM by setting a reference to the MSXML type library, provided in Msxml.dll. To use MSXML from within Visual Basic 6.0:

  1. Open the Project References dialog box.
  2. Select Microsoft XML, version 2.0 from the list of available COM objects. If you do not find this item, you'll need to obtain the MSXML library.
  3. You can then create an instance of the Parser object.

Dim xDoc As MSXML.DOMDocument Set xDoc = New MSXML.DOMDocument

You can install Internet Explorer 5.0—the MSXML parser is an integral component.

Once you reference the type library in your Visual Basic project, invoke the parser, load a document, and work with the date in the document.

You may now be wondering, "so what am I working with? " If you open the MSXML library and examine its object model using the Visual Basic 6.0 Object Browser, you see that the object model is quite rich. This article demonstrates how you can access an XML document using the DOMDocument class and the IXMLDOMNode interface.

Aa468547.beginner001(en-us,MSDN.10).gif

Figure 1. The MSXML parser object model

How Do I Load a Document?

To load an XML document, you must first create an instance of the DOMDocument class:

Dim xDoc As MSXML.DOMDocument
Set xDoc = New MSXML.DOMDocument

Once you obtain a valid reference, open a file using the Load method. The MSXML parser can load XML documents from a local disk, over the network using UNC references, or via a URL.

To load a document from a disk, create the following construct using the Load method:

If xDoc.Load("C:\My Documents\cds.xml") Then
   ' The document loaded successfully.
   ' Now do something intersting.
Else
   ' The document failed to load.
End If

Once you are finished with the document, you need to release your object reference to it. The MSXML parser does not expose an explicit Close method. The best you can do is explicitly set the reference to Nothing.

Set xDoc = Nothing

When you ask the parser to load a file, it does so asynchronously by default. You can change this behavior by manipulating the document's Boolean Async property. It is important that you examine a document's ReadyState property to ensure a document is ready before you start to examine its contents. The ReadyState property can return one of five possible values, as listed below:

State Value
Uninitialized: loading has not started. 0
Loading: while the load method is executing. 1
Loaded: load method is complete. 2
Interactive: enough of the DOM is available for read-only examination and the data has only been partially parsed. 3
Completed: data is loaded and parsed and available for read/write operations. 4

The MSXML parser exposes events that you can use when loading large documents to track the status of the load process. These events are also useful when loading a document from a URL over the Internet asynchronously.

To open a file from a URL, you specify the location of the file using a fully-formed URL. You must include the https:// prefix to the file location.

Here is an example of loading a file from a URL:

xDoc.async = False
If xDoc.Load("https://www.develop.com/hp/brianr/cds.xml") Then
   ' The document loaded successfully.
   ' Now do something intersting.
Else
   ' The document failed to load.
End If

By setting the document's Async property to False, the parser will not return control to your code until the document is completely loaded and ready for manipulation. If you leave it set to True, you will need to either examine the ReadyState property before accessing the document or use the DOMDocument's events to have your code notified when the document is ready.

Dealing with Failure

Your document can fail to load for any number of reasons. A common cause might be that the document name passed to the Load method is invalid. Another cause might be that the XML document itself is invalid.

By default, the MSXML parser will validate your document against a DTD or schema if either has been specified in the document. You can tell the parser not to validate the document by setting the ValidateOnParse property of the DOMDocument object reference before you invoke the Load method.

   Dim xDoc As MSXML.DOMDocument
   Set xDoc = New MSXML.DOMDocument
xDoc.validateOnParse = False
If xDoc.Load("C:\My Documents\cds.xml") Then
   ' The document loaded successfully.
   ' Now do something intersting.
Else
   ' The document failed to load.
End If

Be forewarned that turning off the parser's validation feature is not a good idea in production applications. An incorrect document can lead to your program failing for any number of reasons. At a minimum, it could provide invalid data to your users.

Regardless of the failure type, you can ask the parser to give you information about the failure by accessing the ParseError object. Set a reference to the IXMLDOMParseError interface of the document itself in order to work with the properties of the ParseError object. The IXMLDOMParseError interface exposes seven properties that you can use to investigate the cause of the error.

The following example will display a message box and all the error information available from the ParseError object.

   Dim xDoc As MSXML.DOMDocument
   Set xDoc = New MSXML.DOMDocument
If xDoc.Load("C:\My Documents\cds.xml") Then
   ' The document loaded successfully.
   ' Now do something intersting.
Else
   ' The document failed to load.
   Dim strErrText As String
   Dim xPE As MSXML.IXMLDOMParseError
   ' Obtain the ParseError object
   Set xPE = xDoc.parseError
   With xPE
      strErrText = "Your XML Document failed to load" & _
        "due the following error." & vbCrLf & _
        "Error #: " & .errorCode & ": " & xPE.reason & _
        "Line #: " & .Line & vbCrLf & _
        "Line Position: " & .linepos & vbCrLf & _
        "Position In File: " & .filepos & vbCrLf & _
        "Source Text: " & .srcText & vbCrLf & _
        "Document URL: " & .url
    End With

    MsgBox strErrText, vbExclamation
End If

Set xPE = Nothing

You can use the information exposed by the ParseError object to display this information to the user, log it to an error file, or try to correct the error yourself.

Retrieving Information from an XML Document

Once you have a document loaded, the next step is for you to retrieve information from it. While the document object is important, you will find yourself using the IXMLDOMNode interface most of the time. You use the IXMLDOMNode interface to read and write to individual node elements. Before you do anything, you need to understand that there are currently 13 node types supported by the MSXML parser. The following table lists a few of the most common node types you will encounter.

DOM Node Type Example
NODE_ELEMENT <artist type="band">The Offspring</artist>
NODE_ATTRIBUTE type="band">The Offspring
NODE_TEXT The Offspring
NODE_PROCESSING_INSTRUCTION <?xml version="1.0"?>
NODE_DOCUMENT_TYPE <!DOCTYPE compactdiscs SYSTEM "cds.dtd">

You access the node type via two properties exposed by the IXMLDOMNode interface. The NodeType property exposes an enumeration of DOMNodeType items (some of which are listed in the previous table). In addition, you can use NodeTypeString to retrieve a textual string for the node type.

Once you have a reference to a document, you can start traversing the node hierarchy. From your document reference, you can access the ChildNodes property, which gives you a top-down entry point to all of the nodes in your document. The ChildNodes property exposes the IXMLDOMNodeList, which supports the Visual Basic For/Each construct. Thus, you can enumerate all of the individual nodes of the ChildNodes property. In addition, the ChildNodes property exposes a Level property, which returns the number of child nodes that exists.

Not only does the document object expose a ChildNodes property, but all individual nodes do, as well. This, in conjunction with IXMLDOMNode's HasChildNodes property, makes it easy for you to traverse the node hierarchy examining elements, attributes, and values.

One thing to be aware of is the parent-child relationship between a document element and the element's value. For example, in the CDs XML document, the element <title> exposes a song title. To retrieve the actual value of the <title> element, you need to look for nodes of the type NODE_TEXT. Once you've found a node with some interesting data, you can examine attributes and even reach up and access its parent node via the ParentNode property.

How Do I Traverse a Document?

In an XML document, you traverse the set of nodes exposed by the document object. Because XML documents are hierarchical in nature, it is relatively easy to write a recursive routine to traverse the entire document.

The LoadDocument routine opens an XML document. LoadDocument then calls another routine, DisplayNode, which actually traverses the document. LoadDocument passes a reference to the currently open XML document's ChildNodes property as a parameter and an integer value specifying where to start the indent level. The code uses the Indent parameter to format the display of the text in the Visual Basic Immediate Window of the document structure.

The DisplayNode function traverses the document looking specifically for nodes of the type NODE_TEXT. Once the code finds a node of the type NODE_TEXT, it retrieves the text of the node using the NodeValue property. In addition, the ParentNode property of the current node is used to get a to get a back-reference to a node of the type NODE_ELEMENT. Nodes of the type NODE_ELEMENT expose a NodeName property. The contents of NodeName and NodeValue are displayed.

If a node has children, determined by checking the HasChildNodes property, then DisplayNode calls itself recursively until it reaches the end of the document.

The DisplayNode routine writes the information to Visual Basic's Immediate window using Debug.Print:

Public Sub LoadDocument()
Dim xDoc As MSXML.DOMDocument
Set xDoc = New MSXML.DOMDocument
xDoc.validateOnParse = False
If xDoc.Load("C:\My Documents\sample.xml") Then
   ' The document loaded successfully.
   ' Now do something intersting.
   DisplayNode xDoc.childNodes, 0
Else
   ' The document failed to load.
   ' See the previous listing for error information.
End If
End Sub

Public Sub DisplayNode(ByRef Nodes As MSXML.IXMLDOMNodeList, _
   ByVal Indent As Integer)

   Dim xNode As MSXML.IXMLDOMNode
   Indent = Indent + 2

   For Each xNode In Nodes
      If xNode.nodeType = NODE_TEXT Then
         Debug.Print Space$(Indent) & xNode.parentNode.nodeName & _
            ":" & xNode.nodeValue
      End If

      If xNode.hasChildNodes Then
         DisplayNode xNode.childNodes, Indent
      End If
   Next xNode
End Sub

DisplayNode uses the HasChildNodes property of IXMLDOMNodeList to check for a value greater than 0.

Now What?

This article is just a teaser. You are now ready to dig deeper and expand your knowledge of XML and the MSXML parser. You can do many interesting things like update values of individual node items, search within a document, build your own documents, and more. Visit the MSDN Online XML Developer Center for more examples, articles, and downloads.