Microformats and Open XML

Like most people, I hadn't heard of microformats a year ago. But now the concept seems to be gathering momentum. Microformats have their limitations, but they offer a practical way to solve a common interoperability problem: how to add structured data to existing documents (typically HTML) without changing the underlying schema or breaking existing implementations. The basic concept is that a microformat is a set of "class" attributes that can be added to spans in an HTML page to tag content with semantic meaning. For example, <span class="family-name">Mahugh</span>.

There have been microformats defined for reviews (hReview), calendar items (hCalendar), business cards (hCard), and other applications, and microformat-tagged content is starting to appear on many web pages. Search engines, aggregators, and other types of software can use the microformat to determine what a piece of information means in a particular context (what's being reviewed, when the meeting is scheduled, where the email address is in a business card) without any impact on the visual rendering of the HTML content itself.

I won't go into the details of microformats here (there's plenty of information over on microformats.org if you're interested), but I wanted to use the hCard microformat as an example of how Open XML's custom schema support allows binding of structured document tags (or content controls, as they're called in Word) to arbitrary nodes in a custom XML part.

Mr. Doug Mahugh
One Microsoft Way
Redmond, WA 98052
Phone: +1-425-882-8080

Microformat sample: hCard

Consider, for example, the hCard shown here. This sample is just a DIV containing my contact info, with some minimal CSS styling in a style atrribute and the appropriate hCard attributes added to tag the content with its meaning. If you take a look at the HTML on this page (view source and search for the phone# or something), you'll see that the surrounding div has a class of vcard, and then the content within is tagged as follows:

An hCard as a custom XML part

Now this DIV is just some XML (XHTML). And as I've mentioned before, we can put any well-formed XML in an Open XML document as a custom XML part, and then do creative things with that XML to enable interoperability between various systems. for example, we can put that DIV into a custom XML part, and then bind content controls to the hCard fields (as identified by their classes in the markup above).

I've attached some sample documents that demonstrate this concept. For example, here's a screen shot of content controls bound to the nodes from the hCard-format custom XML part:

Those content controls have 2-way binding to the nodes in the custom XML part. For example, you can correct a typo in my name and then save the DOCX, and your correction is written to the appropriate node in the custom XML part. Or you can replace the custom XML part with a different hCard, and that hCard's data will appear in the content controls.

As a simple example of this concept, you can go to this Live Clipboard demo page and grab a sample hCard from there. Just right-click the orange icon next to any of the names listed, and select Copy. Then you can paste that text into the custom XML part in the attached sample document, and the content controls will now be bound to that hCard. For example, here's what you'll see if you paste in the first example from that page (this is hCard2.docx in the attached samples):

And here's the hCard source data for that example:

You'll notice the syntax isn't identical to the previous example, but the same hCard class attributes are used, and therefore all of the data-binding works the same in both instances. So here we have two simple examples of the same document, with its same visual presentation, providing an interactive editing capability for different instances of business data.

This ability to swap out custom XML parts enables a variety of development scenarios for Open XML solutions. For example, a custom XML part can be generated by some type of system, packaged in an Open XML document, and then travel with the document as a "data payload" that can be displayed and edited through the document interface. Later in the business process, the custom XML part can be extracted (by any programming environment that supports ZIP packages) and passed on to other systems as a clean instance of business data with none of the Open XML schemas included in it. Custom XML parts allow for simple and consistent separation of business data and document-formatting information.

Technical details

If you're interested in understanding the binding mechanism that connects custom XML parts to content controls, take a look at the markup in the attached sample document. There are two aspects involved: a GUID for identifying the custom XML part (since there could be more than one), and the XPath expression for selecting the particular node within the custom XML part. The GUID is stored in a separate "custom XML properties" part, which has a relationship to the custom XML part.

This approach makes the custom XML part itself entirely independent of the details of the binding. The XPath expressions are stored in the structured document tags within the document body, and the GUID is stored in the custom XML properties part. So the custom XML part itself can be replaced at any time, and the new part will populate the content controls as we saw above. And because the XPaths are only looking for a vcard with certain classes inside, the binding is very flexible and tolerant of changes in the format of the custom XML part.

This is important in our microformat example because not all web pages are structured the same way, and the path to a piece of microformat-tagged data can vary considerably between two different HTML documents. For example, one person might put their first and last names in different cells in a table, whereas another person might put them in a single paragraph, and yet another person might put them in separate DIVs. But as long as they're tagged with the appropriate microformat attributes, the data values can be mapped to content controls in a manner that will work with any of these variations.

Next up on our tour of Open XML's custom schema support will be custom content tagging. I mentioned this briefly in my last post, but I'd like to cover this concept in more detail because it's a powerful enabler of interoperability between Open XML documents and XML-aware business applications. I'll cover that later this week.