WordprocessingML Document Model

I thought it might be worthwhile to give a bit of an overview of the WordprocessingML model that you see in the Open XML standard. There are some people who've played with other formats like HTML or DocBook that are curious why WordprocessingML doesn't use that same model as either of those formats, and there's actually a pretty straightforward reason. If you find this interesting, you might also want to check out the Introduction to Word document post I made last year.

WordprocessingML was built with the legacy base of Word documents in mind and it follows the original model you see in those documents pretty closely. This is why Open XML exists. There are other formats out there, but Open XML is the only one that was designed with compatibility with legacy Office documents in mind. We actually had tried other approaches in the past, such as the original SpreadsheetML format in Office XP, and the HTML support that was in Office 2000. In each of those cases though we tried to convert existing binary documents into these new models it quickly showed itself to be fairly problematic. Just take a look at the HTML output from Word for any complex document if you don't know what I'm talking about. In the original SpreadsheetML format we actually used the ISO date format that you're seeing a lot of the Anti-OpenXML folks claim we should use in Open XML. That also proved to be very problematic both from a formula fix-up issues as well as general loading performance, which is why we went with a simple integer value and kept the 1900 leap year bug.

Back to Word though... with WorprocessingML, the model we use is actually a very flat model. Jim Mason actually talks about this in the Open XML discussions going on in the US V1 committee. You can read this dialog in the archive that Sun's Jon Bosak set up: http://www.ibiblio.org/bosak/v1mail/200706/2007Jun05-132133.eml Here's some of what Jim had to say:

Word came to Microsoft from a program orignially developed at Xerox PARC. PARC was very much into object-oriented programming, like Smalltalk, and Word partakes of that spirit. It's also related to stack-oriented, postfix-operator programming languages. Patrick mentions "sections": in Word, a section is just a fairly high-level container to which properties, like margins or page numbering, can be attached. So think of text entry as shoving bytes on a stack. Then at some point you put in an operator that pops the stack and applies itself to everything that comes off. Nesting of sections is somewhat limited; they tend to pop the stack.

I've spent a lot of time working with folks who were long-term users of WordPerfect and who were getting frustrated making the transition to Word. All of them were used to running WordPerfect in split-screen "reveal codes" mode. That's easy when a file is linear and text and codes just came one thing after another. They were always looking for a similar mode in Word and couldn't understand that Word didn't have any such linear structure. They were also used to seeing margins set first, rather than as a postfix property of a whole section after the section content was complete, and they couldn't understand how Word was doing things.

Another aspect of Word's data structure, I've come to recognize, is that a lot of it is implicit. All the data necessary to create the rendition of a Word file in Word itself is in the file, but it's not easily extracted. It becomes evident only when the file is instantiated in the program, which knows how to deal with the implicit parts.

In XML we're used to clearly marked (tagged) containment and hierarchies. We think, for example, of having a <sect1> that contains a <title> and a bunch of <para> elements. If we have a <sect2>, it's clearly delimited inside the <sect1>. That's the sort of thing we see in applications like DocBook (Norm, correct me if you need to).

Of course there have been unstructured document models, like HTML, that tagged only locally, and you could put <H1> and <P> and <H2> in the file in any crazy order. Word appears to be like that: you can put in text and apply styles in any crazy order, so you can have the first paragraph of a document in "Normal" style, then one in "Heading2", one in "NumberedList" and then one in "Heading1". But just as a Web browser attempts to recover from any crazy codes you throw at it, Word attempts to build -- in memory -- a structure from the styles. You can see this structure if you put Word into "Outline View" rather than "Layout View" or "Normal View". This structure acts pretty much like the hierarchical structure with proper containment that we expect from the usual XML application. But it is, so far as I can tell, something that happens only when the file is in memory, and it's driven entirely from the styles. (By default, it works with Word's predefined Heading1, Heading2, etc., styles, though you can map user-generated styles to act like those.)

Jim is pretty close here, but is a bit off on this last part that I've italicized. Word never creates this hierarchy.  Word is always a streaming (cp-based) format.  The "outline" view is an artificial inference of the structure.

So the question isn't whether or not it's a conventional XML structure. Both ODF and Open XML are XML structures; one is flat and the other has a deeper hierarchy (or at least attempts to do that).

The WordprocessingML format represents a stream of content (the data), and the formatting associated with it. Word does not work on this data in a hierarchical manner, nor does it infer a hierarchy when working with it. As such, there is no hierarchy stored in the file format. The way that you impose any type of hierarchy or semantics is through the use of structured document tags (SDTs) like content controls, custom XML, etc.. That hierarchy will then be reflected in the document content and in the file format.

If you intend to use wordprocessingML as a pure data interchange format, and you want the data to be hierarchical in nature, then you will want to use the SDTs in your document for this hierarchy. We actually do this today in our workflows in Microsoft, such as our spec library where we leverage the SDTs to structure the specs for easy interrogation of the spec collection.

Other approaches folks have used to get semantics out of the document would be through the use of styles. Remember though that the Styles are flat since they are just a property of the paragraph or run of text.

The vital thing to understand is formatting itself should not be viewed as structure. The "view" of the data is not PART of the data. The "view" is separate. The fact that you have Heading 2 after heading 1 does not imply a structural relationship between the 2 headings – merely that they LOOK different. In a world that espouses the separation of data and view, this is a great model. There is no attempt to try to invent some hierarchical representation based on the view of the data.

There are places where Word actually attempts at runtime to give the user an impression of hierarchy based simply on the formatting, but this is artificial. As I said, Word never actually creates this hierarchy.  WordprocessingML is always a streaming format.  The "outline" view is an artificial inference of the structure that can be provided by the application.

This is why we've tried to stay really clear on the goals of Open XML. We needed to create and XML format that developers could use, but also one that our customers could move their existing binary documents into without fear of their documents been negatively affected. This is also why we think that choice of formats is a good thing. Office (the application team) knew that we needed Open XML in order to best meet our customer's current needs. But as other applications come along or if customers actually express an interest in another format, we will help improve the translation tools out there so people have that choice.

This also shows why ISO standardization is important. These formats meet an important need, and if you look at the numbers of downloads, you'll see that the number of Open XML documents out there will continue to grow. Over the coming years we will see millions of Open XML documents out there. If ISO approves the Open XML standard, then it helps to ensure that everyone will have long term access to the specs that help them read and write these files.

Open XML Community Quote of the day

Captaris Inc — United States

"Captaris has customers all over the globe who capture, manage and deliver documents that support vital business processes. Consequently they rely on international document standards to insure the open access, portability, security and long-term retrieval of the information. Our customers also want to have a choice in open document formats. On behalf of our customers, we support Open XML and believe it should be ratified by the ISO national standards bodies."

- Dan Lucarini, Senior Director