So what’s the fuss about the new Open XML file formats?

You've probably been hearing about new Office 2007 file formats and that the fact that they are based on Open XML is a great benefit. But why? Why should you as developers care? Are they nothing more than just a compressed file format? Actually, they're much, much, more than that. They're an incredibly powerful building block that you can base your Office Business Applications on. Let me explain in more detail…

First of all, Word, Excel and PowerPoint 2007 files now act as containers since they are actually compressed zip containers (just try changing the extension to a .zip and you'll know what I mean). To an end user, the file still looks like a single item but to the developer, the file is a package of parts, segmented in a logical tree structure, tied together by relationships which you can navigate through. No longer do you have the black box of a binary file from previous file formats. So knowing this, what are some of the benefits?

Interoperability – Because of the open standard of the ECMA Open XML file formats, you can do things like generate files from Office 2007 documents on a non-Microsoft platform like Linux. Take for example a partner called Sonata, who created an awesome solution for the Linux platform where they took an Office 2007 Word document and then using XSLT, converted it to HTML and published it.

Mitigation of File
Corruption – Since we now have a segmented architecture this means that if a part gets corrupted, the other parts of the package should still be safe. For example, if your style part is corrupted, you will still be able to open your document, it just won't look as nice. Also, corruption tends to occur as a result of truncation. Because we are no longer working with a black box, you can mitigate data loss by putting the most important information at the top of the XML files in the package parts.

Security – Security is greatly improved as a result of the segmented architecture. Macros now have their own parts within the package and it is easy to separate this portion from the content to better manager security issues. This is why we have the new file formats such as .xlsm, .pptm, .docm, as well as the template macro-enabled versions. All these have macros contained within them.

Digital Signatures –Now you have more power over what you sign within a document. You can digitally sign the packages using x.509 certificates, and you can sign all parts of the package, including even the digital signatures themselves. Imagine that as part of a workflow you were to sign that a certain portion of content has changed or has not changed, or perhaps you would want to digitally sign just the macro part. Some very interesting scenarios here.


Developer Scenarios

Talking about scenarios, let's get into some developer scenarios. First of all, I would suggest that you couple these scenarios with a workflow in MOSS so that the solution drives itself and stays streamlined within the business framework.

Styling Content:

Because styles are in their separate parts, you can just go into the Style package part and manipulate it. Imagine you are an enterprise which has decided to change your logo and you have hundreds of thousands or documents in your repository. Since the files are no longer binary black boxes, you can now go in and easily change out the logo of all the documents since all of the Office documents have a common structure.

Content Inspection:

  • Remove confidential information, tracked changes or metadata from outbound documents.
  • Remove macros, inappropriate language, or other content from inbound documents.

Consuming Documents:

  • One example is where a user creates an expense report where the data is loaded into a back-end system on the server.

Document Assembly:

This is the most common Open XML development scenario. Imagine the following:

  • Creating sales reports from financial and forecast data stored in a CRM system through the Business Data Catalog in MOSS
  • Sales Forecasting scenario where an executive report in Word format is generated server-side on the fly
  • The user goes to a web application and indicates a number of fragments, parts that he or she wants to see assembled in a document
  • Create documents from the data pulled from Forms Server (online InfoPath form in MOSS)
  • Create pitchbooks from Slide Server (PowerPoint slide libraries in MOSS)

Note : In our business layer, we typically I suggest using some kind of template to give us a headstart.

Custom XML Markup:

  • Tagging document content with custom semantics for processing by a back-end system.
  • This allows the meaning of the data to be defined separately from its presentation, allowing for more robust solutions and simpler programming.
  • Note that custom tagging can be done with or without an associated schema, and with or without a custom XML data store.
  • It's very valuable to be able to go and mark up a document with XML that's in your namespace so you can really identify what the meaningful regions of that document are. This way you can act on that data so that you can actually go and get the information in or out.

Document Merge:

  • Aggregate data from several documents into one
  • Merge two independent copies into one

Document Validation and Debugging:

  • Validate structure of parts
  • Validate XML against schema and for well-formedness
  • Validate document by type or class


VSTO and Open XML

VSTO abstracts working with Open XML so that you never have to go through the XML and navigate through the tree structure. It makes debugging and deployment of your solution on the server much easier. With "Orcas", the next version of Visual Studio Tools for Office, there is support for the Office 2007 file formats with the ServerDocument class which allows for manipulation of the cached data in the file in an abstracted way. Rather than writing the dealing with straight XML you can work with the business objects to manipulate the data within Office 2007 files for both your client side and server side applications, where debugging and deployment is made simpler and more integrated.



This past week at TechEd Orlando, Microsoft announced the Open XML SDK CTP. This provides classes which offer higher levels of abstraction for working with Open XML, code samples, How-To articles. For a great overview of this, check out the Channel9 video with Chris Bryant.

There are some awesome How-To's on the Office Developer How-To Center with videos, articles and code samples. More to come as well.

Of course there is the non-Microsoft which is a wealth of information from a vast Open XML community.

For you Java developers, check out the article on Using Java to crack Office 2007.

And here are some others:

MSDN XML in Office Developer Portal

How To Manipulate Office Open XML File Formats

Word 2007 Content Control Toolkit