Office Space

Building Office Open XML Files

Ted Pattison

Code download available at: Office Space 2007_02.exe(338 KB)

Contents

Generating a .docx File
A Closer Look at Relationships
The Package Viewer Sample Application
New Developer Features in Word 2007
Separating Data from Presentation in Word 2007
Updating the XML Data Store Programmatically
Summary

Welcome to Office Space, a new column devoted to developing with and for the Microsoft® Office system. In this column, I will be exploring how you can extend and customize Microsoft Office system applications and file formats. With the 2007 Microsoft Office system, Microsoft has introduced new file formats for Word, Excel®, and PowerPoint® based on the Office Open XML File Format specification. Office documents stored in this new format are structured inside ZIP archives known as packages. Inside a package, the actual content is stored in components known as parts. Parts are typically stored as internal XML documents whose content is structured in accordance with published XML schemas.

My November 2006 Basic Instincts column introduced the fundamental concepts of the Office Open XML File Formats. I also discussed how to get up and running with the new packaging API that is part of the Microsoft .NET Framework 3.0. If you are new to working with the Office Open XML File Formats you should read my November column before continuing here because this column builds on the topics I discussed there.

Generating a .docx File

In the November Basic Instincts column, I also walked through writing the code required to generate your first .docx file. I have included a skeleton of the basic code required in order to review the basic steps (see Figure 1). First, you must create a new package file using the Package class exposed by the WindowsBase assembly that is part of the .NET Framework 3.0. Second, you must create one or more parts within the package and write into them whatever content these parts require. In the case of a simple .docx file, all that is required is that you create the part with a Uniform Resource Identifier (URI) of /word/document.xml and write WordprocessingML content into this part to create the classic "Hello World" Word document.

Figure 1 Generating a Basic .docx File

Dim pack As Package = Nothing
pack = Package.Open("basic.docx", FileMode.Create, FileAccess.ReadWrite)

'*** create main document part (document.xml) ...
Dim uri As Uri = New Uri("/word/document.xml", UriKind.Relative)
Dim partContentType As String
partContentType = "application/vnd.openxmlformats" & _
                  "-officedocument.wordprocessingml.document.main+xml"
Dim part As PackagePart = pack.CreatePart(uri, partContentType)

'*** get stream for document.xml
Dim streamPart As New StreamWriter( _
    part.GetStream(FileMode.Create, FileAccess.Write))

Dim xmlPart As XmlDocument = New XmlDocument()
'*** code to generate 'Hello World' WordProcessingML omitted for clarity
xmlPart.Save(streamPart)

streamPart.Close()
pack.Flush()

'*** create the relationship part
Dim relationshipType As String
relationshipType = "http://schemas.openxmlformats.org" & _
                   "/officeDocument/2006/relationships/officeDocument"
pack.CreateRelationship(uri, TargetMode.Internal, _
                        relationshipType, "rId1")
pack.Flush()

'*** close package
pack.Close()

Remember that you must specify a content type whenever you create a new part within a package. The code in Figure 1 includes the content type required for the /word/document.xml part as the second parameter in the call to CreatePart. When you call this method and pass a content type, the packaging API takes care of adding whatever entry is required into the content type item named [Content_Types].xml that exists at the root of any package file.

The last thing I want to call your attention to in Figure 1 is the call to the CreateRelationship method. This is a critical line of code because it establishes a parent/child relationship between the package and the /word/document.xml part. As you will see in the next section, creating relationships is critical. If you fail to establish the proper relationships, Word will not be able to load and render the document.

A Closer Look at Relationships

The structure of a package in the Office Open XML File Formats is heavily dependent upon relationships. As mentioned, if you create parts, but fail to associate them to the package through relationships, then consumer applications (such as Word) will not be able to recognize them. That's because every part must have a relationship or a chain of relationships that associate it with its containing package.

There are two types of relationships. First, there are package relationships that define an association between a package and its top-level parts. Figure 2 shows that the typical top-level parts in a .docx file created with Microsoft Office Word 2007 are /docProps/app.xml, /docProps/core.xml and /word/document.xml. Second, there are part relationships that define a parent/child relationship between two parts within the same package. The part /word/document.xml typically has relationships to several different child parts such as /word/settings.xml and /word/styles.xml.

Figure 2 Package Contents

Figure 2** Package Contents **

The Office Open XML File Format specification states that every part inside a package must be associated either directly or indirectly with the package itself. A part such as /word/document.xml is directly associated with the package through a package relationship. Another part such as /word/styles.xml is associated with the package indirectly because it is associated with the top-level part /word/document.xml that is, in turn, associated with the package.

A very important concept to understand is that a consumer application must be able to discover any part within a package by enumerating through its relationships. In fact, when you are writing your own applications that read packages created by other applications such as Word and Excel, you are also encouraged to discover the existing parts by enumerating relationships as well.

Let's look at an example. Imagine you want to write the code to open an existing .docx file so you can print the XML contents of the part /word/document.xml to the console window. You can take an easy approach and access this part using a hardcoded URI:

Dim pack As Package = Package.Open("c:\Data\Hello.docx", _
                                   FileMode.Open, FileAccess.Read)
Dim partUri As Uri = New Uri("/word/document.xml", UriKind.Relative)
Dim part As PackagePart = pack.GetPart(partUri)

'** now add code to program against part

This code gains access to the /word/document.xml part, which then makes it possible to open a stream and read its content. Now let's accomplish the same goal of getting the part a different way. This time I am going to enumerate through the package relationships looking for a specific relationship type. In particular, I will look for the part that is related to the package with the relationship type used for the main document part in a Word document:

Dim pack As Package = Package.Open("c:\Data\Hello.docx", _
    FileMode.Open, FileAccess.Read)

Dim relationshipType As String = _
    "http://schemas.openxmlformats.org" & _
    "/officeDocument/2006/relationships/officeDocument"

Dim rel As PackageRelationship, partUri As Uri

For Each rel In pack.GetRelationshipsByType(relationshipType)
    partUri = PackUriHelper.ResolvePartUri( _
        rel.SourceUri, rel.TargetUri)
    Dim part As PackagePart = pack.GetPart(partUri)
    '** now add code to program against part
Next

As you can see, this code calls the GetRelationshipsByType method on the Package object, passing the string name of the target relationship type. This technique allows you to enumerate through all the package relationships based on that type. In the case of the discovering parts with the officeDocument relationship type shown here, there should only be one part within the package.

After finding the correct PackageRelationship object, the code dynamically builds the URI to the target part using the PackUriHelper class provided by the .NET Framework packaging API. You can acquire the URI object you need by calling ResolvePartUri and passing the SourceUri and TargetUri of the current PackageRelationship object. Once you have dynamically built the URI for the target part, you can call the GetPart method.

Also keep in mind that a Package object is a disposable object and should be managed accordingly. Therefore, your code to access items inside the package should be structured within a Using statement, like this:

    Using pack As Package = _
    Package.Open("c:\Data\Hello.docx", _
    FileMode.Open, FileAccess.Read)
    '*** code to access package goes here
End Using

Now that you have seen the individual pieces, let's look at how everything fits together. Figure 3 shows code from the sample console application named DocumentReader that accompanies this month's column. This code has been written to open the package and discover the target part using a target relationship type. The code then builds a URI so it can open the target part and acquire stream-based access to its content. When you run this console application, it produces the output shown in Figure 4.

Figure 3 Discover Target Parts

Dim relType As String
Dim rel As PackageRelationship 
Dim partUri As Uri

'*** define target relationship type
relType = "http://schemas.openxmlformats.org" & _
    "/officeDocument/2006/relationships/officeDocument"

Using pack As Package = Package.Open("minimal.docx", _
    FileMode.Open, FileAccess.Read)

    For Each rel In pack.GetRelationshipsByType(relType)
        partUri = PackUriHelper.ResolvePartUri( _
            rel.SourceUri, rel.TargetUri)
        Dim part As PackagePart = pack.GetPart(partUri)
        Dim partStream As Stream, partReader As StreamReader
        partStream = part.GetStream(FileMode.Open, FileAccess.Read)
        partReader = New StreamReader(partStream)
        '*** print contents of part to console window
        Console.WriteLine(partReader.ReadToEnd)
        '*** close stream
        partStream.Close()
    Next

End Using

Figure 4 Output from Console Application

Figure 4** Output from Console Application **(Click the image for a larger view)

The Package Viewer Sample Application

Now that you have seen how to discover and access parts within a package using its relationships, let's take this idea a little bit further. As I have mentioned, the specification for the Office Open XML File Format states that all parts within a package must be discoverable through relationships. Therefore, it's possible to write an application that inspects a package and displays all the parts inside of it.

The month's column is accompanied by a second sample application named Package Viewer, a Windows® Forms-based application that populates a tree view control with nodes that show a package and all of its parts nested within a hierarchy of relationships. Figure 5 shows the Package Viewer application in action. You will notice that the application also provides more information about the package and specific parts when you click on a node in the tree view. For each part, you can see its content type, its parent and the relationship type that associates it with its parent.

Figure 5 Package Viewer Inspects the Parts in a Package

Figure 5** Package Viewer Inspects the Parts in a Package **(Click the image for a larger view)

Package Viewer allows the user to select a package file using the standard Open File dialog. Once the user selects a file, the application enumerates through all the package relationships to discover the top-level parts. Figure 6 shows the basic structure of the code, omitting some of the details of populating the Windows tree view control with TreeNode objects.

Figure 6 Populating the Tree View

Dim rootNode As TreeNode 
rootNode = New TreeNode(PackageName, 0, 0)

For Each rel As PackageRelationship _
    In CurrentPackage.GetRelationships()

    Dim PartUri As Uri = PackUriHelper.ResolvePartUri( _
        rel.SourceUri, rel.TargetUri)
    Dim part As PackagePart = CurrentPackage.GetPart(PartUri)
    Dim topLevelPartNode As TreeNode = New TreeNode( _
        part.Uri.OriginalString, 1, 1)

    '*** call helper method to discover child controls
    PopulateChildPartNode(part, topLevelPartNode)

    '*** add node to 
    rootNode.Nodes.Add(topLevelPartNode)
Next

treePackageContents.Nodes.Add(rootNode)

Calling the GetRelationships method on the Package class allows the application to enumerate through all the top-level parts inside the package. If you look inside the For Each loop, you can see that each iteration for a specific top-level part creates a TreeNode object and also calls a helper method named PopulateChildPartNode to deal with the discovery of any related child part and to create TreeNode objects for them as well. Now let's examine the code inside the PopulateChildPartNode method (see Figure 7).

Figure 7 PopulateChildPartNode Method

Private Sub PopulateChildPartNode( _
    ByVal part As PackagePart, ByVal partNode As TreeNode)

    For Each rel As PackageRelationship In part.GetRelationships()

        Dim ChildPartUri As Uri = _
            PackUriHelper.ResolvePartUri(rel.SourceUri, rel.TargetUri)
        Dim ChildPart As PackagePart = part.Package.GetPart(ChildPartUri)

        Dim ChildPartNode As TreeNode = _
            New TreeNode(rel.TargetUri.OriginalString, 1, 1)             

    '** use recursion 
        PopulateChildPartNode(ChildPart, childPartNode)

        partNode.Nodes.Add(ChildPartNode)

    Next

End Sub

You can see that each PackagePart object exposes a GetRelationships method. This makes it possible to enumerate through the child parts for each top-level part. However, you should keep in mind that the relationship hierarchy of parts can grow to an arbitrary number of levels with child parts having child parts having child parts, and so on. Therefore, the application is designed to use recursion by having the PopulateChildPartNode call itself. This makes it possible to write a single method to crawl as many levels as necessary to populate the tree view control with all parts that can been found within the current package.

If you examine the code for the Package Viewer application, you will see that it employs a user-defined structure and the Tag property of each TreeNode object to track various attributes for the entity that each node represents. For example, there's information to track whether each node represents a package, a part, or an external relationship. For nodes that represent parts, there is information to track the part's content type, parent, and the relationship type it has with its parent. When you click on a node in the tree, this information is used to populate the controls on the right-hand side of the application's main form.

Package Viewer also provides the functionality to display the contents of any XML-based parts within the package. It accomplishes this by reopening the package and acquiring stream-based access to the target part. Then the XML-based content of the part is written to a temporary file. Finally, the file is loaded into to the Windows Forms WebBrowser control, which displays the XML content with color coding and collapsible sections. Hopefully, you will find this sample application useful as you begin learning about the parts and the XML required to work with Office Open XML File Format documents.

Now you should have some idea of what's required to generate documents for Word, Excel, and PowerPoint using the Office Open XML File Formats. In theory, it's simple. All you have to do is create a new package file, add the required parts, and fill them up with XML content structured in accordance with the appropriate XML schemas. But while the theory is simple, getting up to speed on the details will take some time. It will also require you to read through the relevant sections of the file format specifications, which can be downloaded from OpenXmlDeveloper.org.

If you are going to work with Word documents, you must learn what type of parts go inside a package and how they must be structured in terms of content types and relationships. You also have to learn how to generate the WordprocessingML that goes inside each of these parts. If you want to work with Excel spreadsheets, the content types and relationships will be different. Instead of using WordprocessingML, you'll have to learn SpreadsheetML. It will take an investment on your part if you want to generate the XML required to create documents from scratch that contain things like tables, graphics, and formatting.

New Developer Features in Word 2007

I'll conclude this month's column by showing you a technique for creating professional-looking Word 2007 documents that will not require writing code to generate WordprocessingML. I'll start by introducing two new features of Word 2007 that can be used when working with documents stored in the new Office Open XML File formats. The first feature is the XML data store, which allows you to embed one or more user-defined XML documents as parts inside a .docx file. The second feature is content controls, which are user interface components defined inside the /word/document.xml part that support data entry and data binding.

If you want to experiment with using content controls, you should first go to the Word Options dialog in Word 2007 and, on the Popular tab, make sure the Show Developer tab in the Ribbon option is checked. On the Developer tab you will see the set of controls shown in Figure 8. All of these controls can be added to a Word document.

Figure 8 Developer Tab Controls

Figure 8** Developer Tab Controls **

Note that you will only be able to add content controls to Word .docx files saved in the new format. They are not supported in .doc files because they cannot be defined using the older binary format. However, when working in the new format, content controls can be added to provide user input elements into a Word document. For example, you can construct a Word document that solicits the user for certain pieces of information to complete a business document, such as a calendar to select a date.

Keep in mind that content controls have two different modes: edit and display. Edit mode allows the user to do things such as type in text, pick a date from a date picker, or select an item from a dropdown list. Display mode is optimized for rendering and printing. In other words, the editing aspects of content controls are invisible to the user after they have finished being edited and they are no longer in edit mode.

Separating Data from Presentation in Word 2007

While the XML data store and content controls are two separate features of Word 2007 that can be used independently of one another, they really pack a punch when used together. For example, you can embed an XML file with the data for a customer or an invoice inside a Word document. Then you can bind content controls to data inside the XML file using XPath expressions. This effectively allows you to separate your data from the formatting instructions that tell Word how to display it.

Let's start with a simple example. You can follow along by examining the sample Word document named CustomerEntry.docx in the download accompanying this month's column. I encourage you to inspect this file with Package Viewer as this allows you to quickly see how all the pieces fit together. However, when actually building one of these documents yourself, you should change the extension of the .docx file to .zip so you can open the package file directly and manipulate its contents as I discussed in the November 2006 Basic Instincts column.

CustomerEntry.docx contains a simple user-defined XML document that contains customer data:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Customer xmlns="http://litware.com/2006/customer">
  <Name>Brian Cox</Name>
  <Address>2732 Baker Blvd</Address>
  <City>Eugene</City>
  <State>OR</State>
  <Zip>97403</Zip>
</Customer>

This XML document is embedded as a part inside the .docx file using a URI of /customXml/item1.xml. Note that you can name this part something other than item1.xml. However, I want to be consistent with the naming scheme that Word 2007 uses when it creates and renames parts in the XML data store.

Next, you need to create a way to identify the content within the customer XML document. You do this by defining a dataStoreItem with an identifying GUID. This is accomplished by creating a part named /customXml/itemProps1.xml and establishing a relationship between this part and its parent part, /customXml/item1.xml. Here's an example of what the contents of /customXml/itemProps1.xml should look like:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ds:dataStoreItem ds:itemID="{D5CB27EE-AE18-48C7-B53F-E921F4653E70}" 
      xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/
      customXml">
  <ds:schemaRefs>
    <ds:schemaRef ds:uri="http://litware.com/2006/customer"/>
  </ds:schemaRefs>
</ds:dataStoreItem>

The next step is to create a part relationship between /word/document.xml and /customXml/item1.xml. The relationship should define the parent part as /word/document.xml and child part as /customXml/item1.xml. The relationship type should be established using the following string:

http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXml

After you have gone through all these steps to set up a user-defined XML document as an identifiable dataStoreItem, you can access its data from /word/document.xml. For example, you can write Visual Basic® for Applications (VBA) code behind a macro-enabled Word 2007 document that retrieves a CustomXmlPart object from the ActiveDocument object and pulls the required customer data out of the embedded XML document. In the example here, I am going to create content controls in /word/document.xml and bind them to particular elements within the XML document with the customer data.

Unfortunately, Word 2007 provides no way through its user interface to bind content controls to elements within the XML data store. Therefore, you are required to make direct edits to the /word/document.xml using an XML editor such as Visual Studio® 2005. Here's an example of the WordprocessingML element you can add to the /word/document.xml part to create a binding to the Name element within the Customer element of the user-defined XML document:

<w:sdtPr>
  <w:id w:val="256125470"/>
  <w:dataBinding 
    w:prefixMappings="xmlns:ns0='http://litware.com/2006/customer'"
    w:xpath="/ns0:Customer[1]/ns0:Name" 
    w:storeItemID="{D5CB27EE-AE18-48C7-B53F-E921F4653E70}"/>
  <w:text/>
</w:sdtPr>

The dataBinding element contains a storeItemID attribute that references the GUID of the dataStoreItem that holds the customer data. The dataBinding element also contains an XPath attribute which defines an XPath expression to bind the content control to a specific element within the user-defined XML file. Once you have updated the /word/document.xml file to contain all the dataBinding elements you need to display the relevant data from the user-defined XML file, you can construct a Word document.

While it might take some time to become fluent with the technique of manually constructing Word documents that contains user-defined XML files and bound content controls, it is worth the effort. It provides an elegant solution to separating your data from the presentation of that data. Once you have created a minimal Word document with bound content controls, you can do the rest of your work for making the document look professional directly within Word. For example, you can add text and a logo just as would any other Word user. You can create new tables and sections and simply drag and drop the content controls where you would like them to appear.

It's also important to keep in mind that bound content controls can provide two-way synchronization with data in a user-defined XML file. While users can see whatever data you have added to the user-defined XML file, they can also update this data. When a user updates the data in a content control and saves the document, those changes are written back to the user-defined XML file. This makes it very easy to extract the updated data in an XML format that is very easy to use.

Updating the XML Data Store Programmatically

Now you have seen how to bind content controls to a user-defined XML document in the XML data store. The final step is writing the code that generates a new instance of that XML document with the data for a particular customer and to overwrite the XML document inside an existing Word document that has been set up with the proper bound content controls.

The download for this month's column includes one more sample application named Letter Generator, an ASP.NET 2.0 application that works with a Word document named LetterTemplate.docx. This Word document has an embedded XML document shown in Figure 9 that contains all the content required for a Litware Corporation employee to draft a letter to a customer.

Figure 9 User-Defined XML Document

<?xml version="1.0" encoding="utf-8"?>
<LitwareLetter xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
               xmlns="http://litware.com/2006/letters">
  <Customer>
    <ContactFirstName>Yoshi</ContactFirstName>
    <ContactLastName>Latimer</ContactLastName>
    <Company>Hungry Coyote Import Store</Company>
    <Address> 516 Main St.</Address>
    <City>Elgin</City>
    <State>OR</State>
    <Zip>97827</Zip>
  </Customer>
  <Date>October 16, 2006</Date>
  <Body>Thanks for your recent purchase at Litware </Body>
  <Employee>
    <Name>Rob Verhoff</Name>
    <Title>Director of Accounting</Title>
  </Employee>
</LitwareLetter>

The Letter Generator application has been written to generate XML documents in this consistent form by pulling information data out of three different tables in an Access™ MDB file. I will let you use your imagination about how you could extend this sample to extract data from other data sources such as SQL Server™ or a custom Web service.

When a user retrieves the page named GenerateLetter.aspx, the user is presented with WebControls that allow for the selection of a customer, a letter type, and the employee who is intended to sign the letter. Once the user has chosen all the correct parameters for the letter to be drafted, the user can then click the button with the caption of Generate Letter to begin the server-side processing of the request.

The server-side code behind the Generate Letter button begins by loading an existing Word 2007 document named LetterTemplate.docx into a MemoryStream object and opening it up with the .NET packaging API. Next, the code creates a new XML document according to the parameters requested by the user. It then opens the /customXML/item1.xml part and overwrites its content with the newly created XML document. Finally, the code closes the package file and streams the resulting Word document back to the user using the same technique I discussed in the November 2006 Basic Instincts column. The end result is that the user is able to automate the process of drafting a customer letter.

Summary

This month's column has continued my examination of working with the Office Open File Formats. You should now understand how relationships and relationship types play a key role in defining the structure for a Word 2007 document. You have also seen a powerful new technique for creating Word 2007 documents that separates data from presentation using content controls bound to elements in a user-defined XML document. Hopefully, this has fueled your imagination and given you a starting point for creating applications and components that automate the process of creating standard business documents for users that rely on Word, Excel, and PowerPoint.

Send your questions and comments for Ted to mmoffice@microsoft.com.

Ted Pattison is an author, trainer, and SharePoint MVP who lives in Tampa, Florida. Ted is writing a book titled Inside Windows SharePoint Services 3.0 for Microsoft Press and delivers advanced SharePoint training to professional developers through his company, Ted Pattison Group.