The Joy of SAX: a Visual Basic Sample

 

By Martin Naughton

June 2000

Summary: This article outlines an approach to programming the SAX2 interfaces from Microsoft Visual Basic.

Download JoyofSax.exe.

Contents

Introduction Hello World? Use the Source, Luke Character Building Stuff SAX2 "Jumpstart" for Visual Basic Wrap it Up Stop the Parser, I Want to Get Off! Type Library Hacking

Introduction

One of the key features of the May 2000 MSXML Technology Preview is an implementation of SAX2 (Simple API for XML, version 2). As an introduction to SAX2, the MSDN XML Developer Center provides an article entitled SAX2 Jumpstart for XML Developers and a downloadable Microsoft® Visual C++® application. In this article, I outline an approach to programming the SAX2 interfaces from Visual Basic®. Note that this sample is unsupported and is intended to assist you in prototyping SAX/Visual Basic solutions. In addition, it should be made clear that the interfaces in this sample have no bearing on Microsoft's Visual Basic support for SAX in the future.

Hello World?

With Visual Basic (VB), the typical approach to experimentation with a Component Object Model (COM) component such as the MSXML3.dll file is to create a new Standard EXE project, and then go to the Project/References menu to add a reference to the MSXML3.dll Type Library (Microsoft XML, version 3.0). At this point, the Visual Basic Object Browser can be used to examine the properties, methods, and events for the interface of your choice.

I did this with MSXML3.dll, and then went looking (in vain) for the SAX2 interfaces, but found nothing remotely SAX-y in this Type Library. I then checked the registry to verify that I had installed the correct version of MSXML3.dll. Sure enough, there were two promising-looking ProgIDs under the HKEY_CLASSES_ROOT directory (specifically Msxml2.SAXXMLReader and Msxml2.SAXXMLReader.3.0), so I fired off one of those "Am I missing something?" type questions to a couple of related newsgroups. Quick as a flash, a response came back indicating that I was indeed missing something.

Use the Source, Luke

The SAX2 interfaces are defined in a file called Xmlsax.idl. If you installed MSXML3.dll into the default folder, then Xmlsax.idl can be found in the folder called C:\Program Files\Microsoft XML Parser SDK\inc. After finding the file, I compiled Xmlsax.idl into a Type Library file called, appropriately enough, Xmlsax.tlb. The MIDL IDL compiler was the tool I used to do the compilation.

Back in the Visual Basic integrated development environment (IDE), I returned to the Project/References menu, and this time I selected Browse from the References dialog box to locate the new Xmlsax.tlb file. As a result of selecting a Type Library by this method, the VB IDE first registers the Type Library file. This was much more encouraging. The SAX2 interfaces were now showing up in the Object Browser.

Character Building Stuff

An inspection of the parameters of some of the methods confirmed what my newsgroup response had suggested: the interface was a little Visual Basic unfriendly. Remember, the documentation in the Microsoft XML SDK 3.0 describes the SAX2 interface as "Microsoft's COM/C++ implementation of SAX2." To summarize, wherever a Visual Basic developer expects to see a string parameter, there are, in fact, two parameters. The first of these parameters has a "pwch" Hungarian prefix ("pointer to an array of wide char"?) and the second has a "cch" prefix ("count of char"?).

Here's what the ISAXContentHandler IDL looks like for the StartElement method:

HRESULT StartElement(
        [in] const wchar_t * pwchNamespaceUri,
        [in] int cchNamespaceUri,
        [in] const wchar_t * pwchLocalName,
        [in] int cchLocalName,
        [in] const wchar_t * pwchQName,
        [in] int cchQName,
        [in] ISAXAttributes * pAttributes);

Luckily, the newsgroup response provided a snippet of Visual Basic code to process these parameters. A simplified version of this function is given here.

Public Function UnicodeArrayToString(pwchArray As Integer, ByVal lCharCount As Long) As String

Dim sText As String

' size the string to the correct no of chars
sText = String$(lCharCount, 0)

' copy over the values. note the size is twice no of chars because the
' string is Unicode (double byte)
Call CopyMemory(ByVal StrPtr(sText), iUnicodeCharArrayFirstElt,
  lCharCount * 2)

UnicodeArrayToString = sText

End Function

The XMLSAX interface returns an array of Visual Basic integers, plus the number of items in the array. The UnicodeArrayToString function uses the undocumented Visual Basic StrPtr function, plus a call to the CopyMemory Windows API to convert these two values into a Visual Basic string. CopyMemory is an alias for the Windows API RtlMoveMemory.

SAX2 "Jumpstart" for Visual Basic

We're now ready to write a version of the "SAX2 Jumpstart" in Visual Basic. To get a result similar to the C++ "Jumpstart" application, use the following instructions. Note that a reference to MSXML is not required.

  1. Create a Visual Basic Standard EXE project.
  2. Add a reference to the XMLSAX.tlb file (included in the download).
  3. Copy and then paste the code that appears below into a Visual Basic Form code window. (After you paste the code, you'll need to remove the carriage returns from the code before it will run).
  4. Add a CommandButton (Command1) to the form.
  5. Save the project.
  6. If it's not already there, place the valid XML file called Test.xml in the same folder where you saved the project.
  7. Run the project in Debug mode.
  8. Click the Command1 button.
  9. Watch the Immediate window.
Option Explicit

Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (pDest
  As Any, pSource As Any, ByVal ByteLen As Long)

Implements XMLSAX.ISAXContentHandler

Private Sub Command1_Click()

Call Parse

End Sub

Private Function Parse() As Boolean

Dim i() As Integer
Dim str As String

Dim sax As SAXXMLReader30

Set sax = New SAXXMLReader30

Call sax.PutContentHandler(Me)

str = App.Path & "\" & "test.xml"

' size the array to the number of characters
ReDim i(Len(str))

' copy the memory, note that the size is twice the number of chars as it is
'a unicode (double byte) string
CopyMemory i(0), ByVal StrPtr(str), Len(str) * 2

' pass the first array entry in
sax.ParseURL i(0), Len(str)

End Function

Private Sub ISAXContentHandler_Characters(pwchChars As Integer, ByVal
  cchChars As Long)

'Do Nothing

End Sub

Private Sub ISAXContentHandler_EndDocument()

'Do Nothing


End Sub

Private Sub ISAXContentHandler_EndElement(pwchNamespaceUri As Integer,
  ByVal cchNamespaceUri As Long, pwchLocalName As Integer, ByVal
  cchLocalName As Long, pwchQName As Integer, ByVal cchQName As Long)

'Do Nothing

End Sub

Private Sub ISAXContentHandler_EndPrefixMapping(pwchPrefix As Integer,
  ByVal cchPrefix As Long)

'Do Nothing


End Sub

Private Sub ISAXContentHandler_IgnorableWhitespace(pwchChars As Integer,
  ByVal cchChars As Long)

'Do Nothing


End Sub

Private Sub ISAXContentHandler_ProcessingInstruction(pwchTarget As
  Integer, ByVal cchTarget As Long, pwchData As Integer, ByVal cchData
  As Long)

'Do Nothing


End Sub

Private Sub ISAXContentHandler_PutDocumentLocator(ByVal pLocator As
  XMLSAX.ISAXLocator)

'Do Nothing

End Sub

Private Sub ISAXContentHandler_SkippedEntity(pwchName As Integer, ByVal
  cchName As Long)

'Do Nothing


End Sub

Private Sub ISAXContentHandler_StartDocument()

End Sub

Private Sub ISAXContentHandler_StartElement(pwchNamespaceUri As Integer,
  ByVal cchNamespaceUri As Long, pwchLocalName As Integer, ByVal
  cchLocalName As Long, pwchQName As Integer, ByVal cchQName As Long,
  ByVal pAttributes As XMLSAX.ISAXAttributes)

Dim str As String

' size the string to the correct no of chars
str = String$(cchLocalName, 0)

' copy over the values. note the size is twice no of chars because the
' string is unicode (double byte)
CopyMemory ByVal StrPtr(str), pwchLocalName, cchLocalName * 2

Debug.Print str

End Sub

Private Sub ISAXContentHandler_StartPrefixMapping(pwchPrefix As Integer,
  ByVal cchPrefix As Long, pwchUri As Integer, ByVal cchUri As Long)

'Do Nothing


End Sub

Wrap it Up

Once I got over the initial hurdles of programming SAX2 from Visual Basic, I decided that the best thing to do in the long-term was to encapsulate all the tricky stuff inside Visual Basic wrapper classes for all the SAX2 interfaces.

The result of this work can be seen in the download available with this article. It's not complete, but it demonstrates all the important points. The example code can be compiled into an ActiveX® component (VBXMLSAX), and then reused (in Visual Basic, VBScript, JScript® or even Visual C++) without knowledge of the implementation details. For those of you who wish to understand some of those implementation issues, read on.

Stop the Parser, I Want to Get Off!

One of the nice features of an event-based parser like SAX2 is the ability to abort parsing if user-defined conditions are met. For example, we may wish to stop parsing if an element called "UNINTERESTING" is encountered in the XML document. SAX2 handler interfaces allow the user to abort the parse. We will attempt to expose the methods on the handler interfaces as Visual Basic events in our wrapper class, such that the event sink (for example, a VB Form object) can control the termination of parsing.

A glance at the documentation for the handler family of interfaces tells us that a handler method indicates the need to abort parsing by setting the return value (an HRESULT) to the symbolic ERR_FAIL value. There's only one problem: VB hides the HRESULT, so you can't set this value in your custom handler code.

In fact, you can set this value, but you have to jump through some hoops to do it. The technique is called "vtable modification" and is described in Bruce McKinney's book Hardcore Visual Basic. It is possible at run time to redirect the call from the Visual Basic-provided event handler to your own function. The key to this technique is the use of the AddressOf operator to obtain the overriding function's address, then to use CopyMemory to overwrite the appropriate interface's vtable entry. Because the overriding function is defined by you (and not by Visual Basic), then the return value is controllable.

Let's look at an example. The ISAXContentHandler interface provides a method called StartDocument. If we create a Visual Basic class wrapper that implements ISAXContentHandler (CVBSAXContentHandler), Visual Basic provides you with a handler routine prototype like the following:

Private Sub ISAXContentHandler_StartDocument()

Behind the scenes, the function prototype is really this:

Public Function ISAXContentHandler_StartDocument(ByVal This As
  ISAXContentHandler) As Long

In fact, the return type is HRESULT, but the Visual Basic Long data type can be used to store an HRESULT value. Therefore, to override the StartDocument method, we create a function with the latter prototype in a BAS file. It is this overriding function that is called at run time, instead of the routine in the Visual Basic Class wrapper. Note the addition of a parameter called This. It is the instance of the ISAXContentHandler handler that is being called. Here is the override function for StartDocument:

Public Function ISAXContentHandler_StartDocument(ByVal This As ISAXContentHandler) As Long

#If iDebug = -1 Then
  Debug.Print "VTable Replacement for ISAXContentHandler_StartDocument"
#End If

  Dim booAbort As Boolean

  Dim objVBSAXContentHandler As CVBSAXContentHandler
  Set objVBSAXContentHandler = This

  Call objVBSAXContentHandler.StartDocument(booAbort)

  If booAbort Then
    ISAXContentHandler_StartDocument = E_FAIL
  Else
    ISAXContentHandler_StartDocument = S_OK
  End If

End Function

There is a problem with using this technique. The AddressOf operator can be applied only to functions defined in a standard module (a BAS file). Therefore, the same function will be called for all instances of a particular handler interface. This gives us the problem of determining which instance of our handler wrapper is involved because it is the wrapper that raises the event.

Luckily, the required format of a vtable override function contains the This parameter (see previous code sample), whose type is the interface being called. From this parameter, we can perform the equivalent of QueryInterface by declaring a variable of type CVBSAXContentHandler, then making the following assignment.

Dim objCVBSAXContentHandler As CVBSAXContentHandler
Set objCVBSAXContentHandler = This

The call effectively obtains the Visual Basic wrapper interface. We can then call the wrapper, so that it raises the event in the expected manner.

To achieve this effect, we define a method on CVBSAXContentHandler with "Friend" scope. In this way, code internal to the VBSAXXML component can access the function but external clients cannot even see it. The implementation of this Friend method simply raises an event, passing all the original parameters, plus the Abort flag (passed with ByRef), so the code handling this event can set the flag. Here's the code:

Friend Sub StartDocument(Abort As Boolean)

  RaiseEvent StartDocument(Abort)

End Sub

Upon return from the StartDocument event handler in the client application (for example, a form), the Friend StartDocument method in CVBSAXContentHandler simply passes back the value of the Abort parameter to the vtable override function. Based on the value of the Abort flag, the vtable override can appropriately set the HRESULT return value of the underlying ISAXContentHandler.StartElement method.

Other Handler methods can be overridden in a similar fashion. Here's the prototype for the function to override the StartElement method, which has parameters of its own.

Public Function ISAXContentHandler_StartElement(ByVal This As
  ISAXContentHandler, pwchNamespaceUri As Integer, ByVal cchNamespaceUri
  As Long, pwchLocalName As Integer, ByVal cchLocalName As Long, pwchQName
  As Integer, ByVal cchQName As Long, ByVal pAttributes As
  XMLSAX.ISAXAttributes) As Long

The Friend equivalent looks like the following:

Friend Sub StartElement(ByVal NamespaceUri As String, ByVal LocalName As
  String, ByVal QName As String, ByVal Attributes As CVBSAXAttributes,
  Abort As Boolean)

  RaiseEvent StartElement(NamespaceUri, LocalName, QName, Attributes,
  Abort)

End Sub

Where a method has its own parameters, it is the vtable override function that performs the mapping from C++ style parameters to Visual Basic friendly strings.

Type Library Hacking

It turns out that, in some cases, the Xmlsax.tlb Type Library produced by compiling the supplied Xmlsax.idl file is not merely Visual Basic-unfriendly, but is in fact unusable in Visual Basic. Specifically, parameters to methods defined in IDL as [out] are unacceptable to Visual Basic. If the parameter is defined as [in, out] (a ByRef parameter) or [out, retval] (the return value), Visual Basic does not allow it to be used. This is the case for some of the methods in the SAX2 interfaces.

In addition, I ran into trouble calling most of the methods on the ISAXAttributes interface. For example, the GetLocalName method yields a subroutine in Visual Basic. The caller passes an index into the Attributes collection. The routine returns the by-now-familiar array of integers and array-length parameters. The IDL looks like the following:

HRESULT GetLocalName(
        [in] int nIndex,
        [out] const wchar_t ** ppwchLocalName,
        [out] int * pcchLocalName);

I knew that the [out] parameters are not acceptable in Visual Basic. Therefore, I decided to change the IDL for the [out] parameters to [in, out]. However, I wasn't sure how I was going to decode that const wchar_t ** data type. I noticed that the Hardcore Visual Basic book delivers a function called UPointerToString, which looked promising. However, UPointerToString expects a pointer (As Long in VB terms). Therefore, I decided to alter the data type from const wchar_t ** to int *. The modified IDL looks like the following:

HRESULT GetLocalName(
        [in] int nIndex,
        [in, out] int * ppwchLocalName,
        [in, out] int * pcchLocalName);

This format turns out to be acceptable in Visual Basic. The string value is decoded successfully with UPointerToStringEx (same as UPointerToString, but it takes the character count as a parameter).

The methods on ISAXAttributes can now be called as a subroutine:

  Call pAttributes.GetValue(0, pwchValue, cchValue)

The above call gets the value for the attribute at index position 0, assigning the pointer to the value in pwchValue and the length of the value in cchValue. The VB string is then returned from the following code (similar to those made before).

  sAttValue = String$(cchValue, 0)
  CopyMemory ByVal StrPtr(sAttValue), ByVal pwchValue, cchValue * 2

Note that the pwchValue must be prefaced with ByVal.

I copied the Xmlsax.idl file, called it XmlsaxVB.idl, then modified the copy before finally compiling the modified copy into XmlsaxVB.tlb. It is this modified Type Library that is referenced in the ActiveX component project included in the download.

Of course, modifying IDL files is not recommended as a general rule. It is, however, recommended that modified Type Libraries never leave the computer you are developing on. If you create a Setup program with the Package and Deployment Wizard, such a Type Library should not be included in the Setup package. However, since the MSXML component discussed in this article is a Technology Preview, there's no danger that anyone will try to ship this example in a production environment.