The Joy of SAX: a Visual Basic Sample
By Martin Naughton
Summary: This article outlines an approach to programming the SAX2 interfaces from Microsoft Visual Basic.
Use the Source, Luke
Character Building Stuff
SAX2 "Jumpstart" for Visual Basic
Wrap it Up
Stop the Parser, I Want to Get Off!
Type Library Hacking
One of the key features of the May 2000 MSXML Technology Preview is an implementation of SAX2 (Simple API for XML, version 2). As an introduction to SAX2, the MSDN XML Developer Center provides an article entitled SAX2 Jumpstart for XML Developers and a downloadable Microsoft® Visual C++® application. In this article, I outline an approach to programming the SAX2 interfaces from Visual Basic®. Note that this sample is unsupported and is intended to assist you in prototyping SAX/Visual Basic solutions. In addition, it should be made clear that the interfaces in this sample have no bearing on Microsoft's Visual Basic support for SAX in the future.
With Visual Basic (VB), the typical approach to experimentation with a Component Object Model (COM) component such as the MSXML3.dll file is to create a new Standard EXE project, and then go to the Project/References menu to add a reference to the MSXML3.dll Type Library (Microsoft XML, version 3.0). At this point, the Visual Basic Object Browser can be used to examine the properties, methods, and events for the interface of your choice.
I did this with MSXML3.dll, and then went looking (in vain) for the SAX2 interfaces, but found nothing remotely SAX-y in this Type Library. I then checked the registry to verify that I had installed the correct version of MSXML3.dll. Sure enough, there were two promising-looking ProgIDs under the HKEY_CLASSES_ROOT directory (specifically Msxml2.SAXXMLReader and Msxml2.SAXXMLReader.3.0), so I fired off one of those "Am I missing something?" type questions to a couple of related newsgroups. Quick as a flash, a response came back indicating that I was indeed missing something.
Use the Source, Luke
The SAX2 interfaces are defined in a file called Xmlsax.idl. If you installed MSXML3.dll into the default folder, then Xmlsax.idl can be found in the folder called C:\Program Files\Microsoft XML Parser SDK\inc. After finding the file, I compiled Xmlsax.idl into a Type Library file called, appropriately enough, Xmlsax.tlb. The MIDL IDL compiler was the tool I used to do the compilation.
Back in the Visual Basic integrated development environment (IDE), I returned to the Project/References menu, and this time I selected Browse from the References dialog box to locate the new Xmlsax.tlb file. As a result of selecting a Type Library by this method, the VB IDE first registers the Type Library file. This was much more encouraging. The SAX2 interfaces were now showing up in the Object Browser.
Character Building Stuff
An inspection of the parameters of some of the methods confirmed what my newsgroup response had suggested: the interface was a little Visual Basic unfriendly. Remember, the documentation in the Microsoft XML SDK 3.0 describes the SAX2 interface as "Microsoft's COM/C++ implementation of SAX2." To summarize, wherever a Visual Basic developer expects to see a string parameter, there are, in fact, two parameters. The first of these parameters has a "pwch" Hungarian prefix ("pointer to an array of wide char"?) and the second has a "cch" prefix ("count of char"?).
Here's what the ISAXContentHandler IDL looks like for the StartElement method:
HRESULT StartElement( [in] const wchar_t * pwchNamespaceUri, [in] int cchNamespaceUri, [in] const wchar_t * pwchLocalName, [in] int cchLocalName, [in] const wchar_t * pwchQName, [in] int cchQName, [in] ISAXAttributes * pAttributes);
Luckily, the newsgroup response provided a snippet of Visual Basic code to process these parameters. A simplified version of this function is given here.
Public Function UnicodeArrayToString(pwchArray As Integer, ByVal lCharCount As Long) As String Dim sText As String ' size the string to the correct no of chars sText = String$(lCharCount, 0) ' copy over the values. note the size is twice no of chars because the ' string is Unicode (double byte) Call CopyMemory(ByVal StrPtr(sText), iUnicodeCharArrayFirstElt, lCharCount * 2) UnicodeArrayToString = sText End Function
The XMLSAX interface returns an array of Visual Basic integers, plus the number of items in the array. The UnicodeArrayToString function uses the undocumented Visual Basic StrPtr function, plus a call to the CopyMemory Windows API to convert these two values into a Visual Basic string. CopyMemory is an alias for the Windows API RtlMoveMemory.
SAX2 "Jumpstart" for Visual Basic
We're now ready to write a version of the "SAX2 Jumpstart" in Visual Basic. To get a result similar to the C++ "Jumpstart" application, use the following instructions. Note that a reference to MSXML is not required.
- Create a Visual Basic Standard EXE project.
- Add a reference to the XMLSAX.tlb file (included in the download).
- Copy and then paste the code that appears below into a Visual Basic Form code window. (After you paste the code, you'll need to remove the carriage returns from the code before it will run).
- Add a CommandButton (Command1) to the form.
- Save the project.
- If it's not already there, place the valid XML file called Test.xml in the same folder where you saved the project.
- Run the project in Debug mode.
- Click the Command1 button.
- Watch the Immediate window.
Option Explicit Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (pDest As Any, pSource As Any, ByVal ByteLen As Long) Implements XMLSAX.ISAXContentHandler Private Sub Command1_Click() Call Parse End Sub Private Function Parse() As Boolean Dim i() As Integer Dim str As String Dim sax As SAXXMLReader30 Set sax = New SAXXMLReader30 Call sax.PutContentHandler(Me) str = App.Path & "\" & "test.xml" ' size the array to the number of characters ReDim i(Len(str)) ' copy the memory, note that the size is twice the number of chars as it is 'a unicode (double byte) string CopyMemory i(0), ByVal StrPtr(str), Len(str) * 2 ' pass the first array entry in sax.ParseURL i(0), Len(str) End Function Private Sub ISAXContentHandler_Characters(pwchChars As Integer, ByVal cchChars As Long) 'Do Nothing End Sub Private Sub ISAXContentHandler_EndDocument() 'Do Nothing End Sub Private Sub ISAXContentHandler_EndElement(pwchNamespaceUri As Integer, ByVal cchNamespaceUri As Long, pwchLocalName As Integer, ByVal cchLocalName As Long, pwchQName As Integer, ByVal cchQName As Long) 'Do Nothing End Sub Private Sub ISAXContentHandler_EndPrefixMapping(pwchPrefix As Integer, ByVal cchPrefix As Long) 'Do Nothing End Sub Private Sub ISAXContentHandler_IgnorableWhitespace(pwchChars As Integer, ByVal cchChars As Long) 'Do Nothing End Sub Private Sub ISAXContentHandler_ProcessingInstruction(pwchTarget As Integer, ByVal cchTarget As Long, pwchData As Integer, ByVal cchData As Long) 'Do Nothing End Sub Private Sub ISAXContentHandler_PutDocumentLocator(ByVal pLocator As XMLSAX.ISAXLocator) 'Do Nothing End Sub Private Sub ISAXContentHandler_SkippedEntity(pwchName As Integer, ByVal cchName As Long) 'Do Nothing End Sub Private Sub ISAXContentHandler_StartDocument() End Sub Private Sub ISAXContentHandler_StartElement(pwchNamespaceUri As Integer, ByVal cchNamespaceUri As Long, pwchLocalName As Integer, ByVal cchLocalName As Long, pwchQName As Integer, ByVal cchQName As Long, ByVal pAttributes As XMLSAX.ISAXAttributes) Dim str As String ' size the string to the correct no of chars str = String$(cchLocalName, 0) ' copy over the values. note the size is twice no of chars because the ' string is unicode (double byte) CopyMemory ByVal StrPtr(str), pwchLocalName, cchLocalName * 2 Debug.Print str End Sub Private Sub ISAXContentHandler_StartPrefixMapping(pwchPrefix As Integer, ByVal cchPrefix As Long, pwchUri As Integer, ByVal cchUri As Long) 'Do Nothing End Sub
Wrap it Up
Once I got over the initial hurdles of programming SAX2 from Visual Basic, I decided that the best thing to do in the long-term was to encapsulate all the tricky stuff inside Visual Basic wrapper classes for all the SAX2 interfaces.
The result of this work can be seen in the download available with this article. It's not complete, but it demonstrates all the important points. The example code can be compiled into an ActiveX® component (VBXMLSAX), and then reused (in Visual Basic, VBScript, JScript® or even Visual C++) without knowledge of the implementation details. For those of you who wish to understand some of those implementation issues, read on.
Stop the Parser, I Want to Get Off!
One of the nice features of an event-based parser like SAX2 is the ability to abort parsing if user-defined conditions are met. For example, we may wish to stop parsing if an element called "UNINTERESTING" is encountered in the XML document. SAX2 handler interfaces allow the user to abort the parse. We will attempt to expose the methods on the handler interfaces as Visual Basic events in our wrapper class, such that the event sink (for example, a VB Form object) can control the termination of parsing.
A glance at the documentation for the handler family of interfaces tells us that a handler method indicates the need to abort parsing by setting the return value (an HRESULT) to the symbolic ERR_FAIL value. There's only one problem: VB hides the HRESULT, so you can't set this value in your custom handler code.
In fact, you can set this value, but you have to jump through some hoops to do it. The technique is called "vtable modification" and is described in Bruce McKinney's book Hardcore Visual Basic. It is possible at run time to redirect the call from the Visual Basic-provided event handler to your own function. The key to this technique is the use of the AddressOf operator to obtain the overriding function's address, then to use CopyMemory to overwrite the appropriate interface's vtable entry. Because the overriding function is defined by you (and not by Visual Basic), then the return value is controllable.
Let's look at an example. The ISAXContentHandler interface provides a method called StartDocument. If we create a Visual Basic class wrapper that implements ISAXContentHandler (CVBSAXContentHandler), Visual Basic provides you with a handler routine prototype like the following:
Private Sub ISAXContentHandler_StartDocument()
Behind the scenes, the function prototype is really this:
Public Function ISAXContentHandler_StartDocument(ByVal This As ISAXContentHandler) As Long
In fact, the return type is HRESULT, but the Visual Basic Long data type can be used to store an HRESULT value. Therefore, to override the StartDocument method, we create a function with the latter prototype in a BAS file. It is this overriding function that is called at run time, instead of the routine in the Visual Basic Class wrapper. Note the addition of a parameter called This. It is the instance of the ISAXContentHandler handler that is being called. Here is the override function for StartDocument:
Public Function ISAXContentHandler_StartDocument(ByVal This As ISAXContentHandler) As Long #If iDebug = -1 Then Debug.Print "VTable Replacement for ISAXContentHandler_StartDocument" #End If Dim booAbort As Boolean Dim objVBSAXContentHandler As CVBSAXContentHandler Set objVBSAXContentHandler = This Call objVBSAXContentHandler.StartDocument(booAbort) If booAbort Then ISAXContentHandler_StartDocument = E_FAIL Else ISAXContentHandler_StartDocument = S_OK End If End Function
There is a problem with using this technique. The AddressOf operator can be applied only to functions defined in a standard module (a BAS file). Therefore, the same function will be called for all instances of a particular handler interface. This gives us the problem of determining which instance of our handler wrapper is involved because it is the wrapper that raises the event.
Luckily, the required format of a vtable override function contains the This parameter (see previous code sample), whose type is the interface being called. From this parameter, we can perform the equivalent of QueryInterface by declaring a variable of type CVBSAXContentHandler, then making the following assignment.
Dim objCVBSAXContentHandler As CVBSAXContentHandler Set objCVBSAXContentHandler = This
The call effectively obtains the Visual Basic wrapper interface. We can then call the wrapper, so that it raises the event in the expected manner.
To achieve this effect, we define a method on CVBSAXContentHandler with "Friend" scope. In this way, code internal to the VBSAXXML component can access the function but external clients cannot even see it. The implementation of this Friend method simply raises an event, passing all the original parameters, plus the Abort flag (passed with ByRef), so the code handling this event can set the flag. Here's the code:
Friend Sub StartDocument(Abort As Boolean) RaiseEvent StartDocument(Abort) End Sub
Upon return from the StartDocument event handler in the client application (for example, a form), the Friend StartDocument method in CVBSAXContentHandler simply passes back the value of the Abort parameter to the vtable override function. Based on the value of the Abort flag, the vtable override can appropriately set the HRESULT return value of the underlying ISAXContentHandler.StartElement method.
Other Handler methods can be overridden in a similar fashion. Here's the prototype for the function to override the StartElement method, which has parameters of its own.
Public Function ISAXContentHandler_StartElement(ByVal This As ISAXContentHandler, pwchNamespaceUri As Integer, ByVal cchNamespaceUri As Long, pwchLocalName As Integer, ByVal cchLocalName As Long, pwchQName As Integer, ByVal cchQName As Long, ByVal pAttributes As XMLSAX.ISAXAttributes) As Long
The Friend equivalent looks like the following:
Friend Sub StartElement(ByVal NamespaceUri As String, ByVal LocalName As String, ByVal QName As String, ByVal Attributes As CVBSAXAttributes, Abort As Boolean) RaiseEvent StartElement(NamespaceUri, LocalName, QName, Attributes, Abort) End Sub
Where a method has its own parameters, it is the vtable override function that performs the mapping from C++ style parameters to Visual Basic friendly strings.
Type Library Hacking
It turns out that, in some cases, the Xmlsax.tlb Type Library produced by compiling the supplied Xmlsax.idl file is not merely Visual Basic-unfriendly, but is in fact unusable in Visual Basic. Specifically, parameters to methods defined in IDL as [out] are unacceptable to Visual Basic. If the parameter is defined as [in, out] (a ByRef parameter) or [out, retval] (the return value), Visual Basic does not allow it to be used. This is the case for some of the methods in the SAX2 interfaces.
In addition, I ran into trouble calling most of the methods on the ISAXAttributes interface. For example, the GetLocalName method yields a subroutine in Visual Basic. The caller passes an index into the Attributes collection. The routine returns the by-now-familiar array of integers and array-length parameters. The IDL looks like the following:
HRESULT GetLocalName( [in] int nIndex, [out] const wchar_t ** ppwchLocalName, [out] int * pcchLocalName);
I knew that the [out] parameters are not acceptable in Visual Basic. Therefore, I decided to change the IDL for the [out] parameters to [in, out]. However, I wasn't sure how I was going to decode that const wchar_t ** data type. I noticed that the Hardcore Visual Basic book delivers a function called UPointerToString, which looked promising. However, UPointerToString expects a pointer (As Long in VB terms). Therefore, I decided to alter the data type from const wchar_t ** to int *. The modified IDL looks like the following:
HRESULT GetLocalName( [in] int nIndex, [in, out] int * ppwchLocalName, [in, out] int * pcchLocalName);
This format turns out to be acceptable in Visual Basic. The string value is decoded successfully with UPointerToStringEx (same as UPointerToString, but it takes the character count as a parameter).
The methods on ISAXAttributes can now be called as a subroutine:
Call pAttributes.GetValue(0, pwchValue, cchValue)
The above call gets the value for the attribute at index position 0, assigning the pointer to the value in pwchValue and the length of the value in cchValue. The VB string is then returned from the following code (similar to those made before).
sAttValue = String$(cchValue, 0) CopyMemory ByVal StrPtr(sAttValue), ByVal pwchValue, cchValue * 2
Note that the pwchValue must be prefaced with ByVal.
I copied the Xmlsax.idl file, called it XmlsaxVB.idl, then modified the copy before finally compiling the modified copy into XmlsaxVB.tlb. It is this modified Type Library that is referenced in the ActiveX component project included in the download.
Of course, modifying IDL files is not recommended as a general rule. It is, however, recommended that modified Type Libraries never leave the computer you are developing on. If you create a Setup program with the Package and Deployment Wizard, such a Type Library should not be included in the Setup package. However, since the MSXML component discussed in this article is a Technology Preview, there's no danger that anyone will try to ship this example in a production environment.