Some background information on the reasons we have moved to an XML format as the default in Office "12"

Article
09/29/2005

This continues to be a really good discussion and I want to thank everyone who has taken time to post their comments. I know there is a lot to read through and it can get a bit confusing at times, so I'm really glad to see so many of you are up for it. There were a number of comments in the last post where folks said that ultimately there was just a lack of trust in our motivations. That really concerns me so I wanted to get back to a discussion around our motivations. I think that after you see clearly why we are doing this work, you'll probably have a better time understanding where it's going and you'll see why we aren't going to "pull the rug out from under anyone." As I mentioned, Steven Sinofsky addressed this over a year ago: https://www.microsoft.com/office/xml/response.mspx

I think the first place to start is to quickly dispel a common myth I've heard a lot over the years. Some people seem to think that the file formats have some special importance in some kind of competition. It's just not true; at least it hasn't been true for a long time. You have to remember the Microsoft first started developing Office over 20 years ago. At the time, the binary formats were the state of the art. They were fast, small and optimized to take advantage of the feature set of the product. At the same time, they were brittle and not very good if you wanted to reuse the data in the document or attempt some kind of interop with another program. These limitations weren't intentional -- that was just the state of the technology. Corel and Lotus had the same issue. Over the years, we've continued to make some adjustments to those formats, but it's been very incremental and we've placed a high value on backward compatibility. The last time we made a breaking change to the binary formats of today goes back to the start of the Office '97 project (Jan '94 I believe). The main issue we were worried about at that point was that we wanted the format to mirror our internal memory structures so that it was easy to read and write portions of it to disk. We were still concerned with how document behaved when stored on a floppy disk, and performance when you had low memory environments. Documents behavior within business processes wasn't on our mind at all. The typical user in the early 90’s didn’t have a LAN and didn’t share docs all that much – they mainly printed. It was also standard procedure for apps to upgrade the file format with each release to enable new features to be saved. Since people didn’t share files much (other than in their printed form), the main scenario was a single user updating their own version. Of course old version files could be read in the new version, but there was no need to output the old version since you didn’t use it anymore now that you had the new one. In any event, this is hardly the behavior of a company that thinks it has some tremendous value in their particular format. To me, it seems more consistent with a company that cares about it's customer's experience...

For Office 2000, the internet wave had hit strong. We all thought that web pages were the documents of the future and we wanted Office applications to have the ability to save and edit those documents. So, we spent a ton of time (almost 25% of the overall dev budget) making it possible to save any Office document in HTML. Anyone who has used our HTML functionality extensively knows that it's hard to balance HTML simplicity with document fidelity. At the time we were very proud of our work here and couldn't wait to see it take off. I believe that we learned a lot from that experience. Our scenario was that people would start saving “docs” as HTML on their intranet sites and browse them with the browser. We viewed the browser as “electronic paper” that we had to “print” to (i.e. perfect fidelity). We had already got a lot of feedback from our Word97 Internet Assistant add-in that any loss of fidelity when saving as a web page was unacceptable and a “bug”. As it turned out, this usage scenario did not become as common as we thought it would and a zillion conspiracy theories formed about why we “really” did it. Many people assumed that a better approach would have been to save as “clean” HTML even if the result did not look exactly like what the user saw on the screen. We felt that the core office applications (other than FrontPage) were not really meant to be web page authoring tools, so we focused on converting docs to exact replicas in HTML. We didn't want people losing any functionality when saving to HTML so we had to figure out a way to store everything that could have existed in a binary document as HTML. We thought we were clever creating a bunch of "mso-" css properties that allowed us to roundtrip everything. HTML didn't take off in the same way we had expected, and today, the main use for Office HTML is for interoperability on the clipboard, though of course the biggest use is within e-mail (WordMail).

For Office XP (started in 1998), we really started thinking seriously about how Office documents were used outside of the applications. Of course the SGML folks (like Charles Goldfarb and Jean Paoli who has been working on this stuff for over 2 decades) have all been saying that for ages and they were right! We had spent so much time focusing on making it really easy to create documents, but hadn't thought a lot about what happens once those documents are created. This is one of the benefits we saw in a new feature called SmartTags. We not only wanted to give you useful actions to take on the content of your documents, we also wanted to make it possible to tag that content so that it could be leveraged by other processes. We also built the first XML file format in Office, SpreadsheetML. Folks on Wall Street and in Finance offices in particular had wanted ways to pull information out of their financial models. For equity research reports, they valued both the speed with which they could publish a report, and the accuracy of the data in the report. By storing their Excel models as XML, it made it easier to quickly pull data out without having to run Excel. This meant they could run their code on a server, and then use that data to verify that the information in the reports was accurate. This was really just the beginning though. The two big problems were that the SpreadsheetML format wasn't full fidelity (meaning not everything in the file could be saved as XML), and there wasn't an XML format for Word, which they used to generate the reports.

In Office 2003, we really started to gain a lot of momentum around XML. We had heard from a number of big customers that they needed XML support for their Word documents. People were trying all kinds of hacks on top of the Object Models to produce XML that they could work with. We had Wall Street firms with the need to integrate with XML more dramatically than we had imagined, so that they could do structured authoring with repurposable data. We had law firms that were trying to build solutions that could automatically generate legal documents based on data about who was involved in the case, as well as business logic around what pieces of content were required for that case. We also were getting a lot of demand for supporting other people's existing internal schemas. Not only did people want the Word document itself represented in XML, they also wanted to add their own XML markup to the files. Let's take a government office as an example here. Imagine they have a template that folks can use to submit to receive a permit. While it's nice that the formatting information can be represented in XML, they don't care as much about what's bold, numbered, or any other kind of random formatting. What they do care about is the name of the person that submitted the permit; what their address is; and what type of work they are seeking a permit for. Those things can all be labeled using content controls and custom XML.

It was this support for both reference schemas (SpreadsheetML and WordprocessingML) in combination with support for customer defined schemas (your own XML) that finally made it possible for the content of Office documents to play a role in business processes. We had moved from the world of the Office document being a black box that only had a small collection of meta-data scrawled on top; to being an open, interoperable, extensible, and extremely valuable piece of business processes.

At the same time, there are zillions of documents out there in older binary formats. We had to ask ourselves "who is going to take care to make sure those older document have a path forward?" "Who is focusing on doing the hard work to preserve fidelity between the new and the old?" We're doing that. We're making a deep investment in this compatibility to make sure our customers have a very good experience.

Now we move to Office "12". We are still building on the momentum we started over 6 years ago. Not only are we improving the XML formats so that they can represent every Word, PowerPoint, and Excel document out there, but we are making it the default format. We viewed this as something that we absolutely had to do this version. Office documents are so much more important as elements of business processes than we had initially been giving them credit for. You may have seen how we now talk about Office as a system. This is because it's no longer about the documents behavior in the application. It's about the entire document lifecycle. We have helped ourselves in all kinds of ways that no one has really thought about (or at least written about) yet. We can build smarts into Windows Sharepoint Services so that the server can actually look into the document, make decisions based on the document content, write data back into the document, all without having to run application code. We have a world where customers need to track and audit parts of documents that they never needed to do before.

We have customers in equity research who can't wait for these new formats with the content controls and custom XML support. The speed with which they will be able to publish their documents, while at the same time meeting the increasing regulation requirements is amazing. All the information within each research report is available to them. The system used to consist of printing out the report and having humans read through each one verifying the financial figures and making sure they had all the necessary disclosures. Now that can just be an easily automated piece of the larger workflow.

There is a customer (a bank) that we've been meeting with that generates documents on demand for all their loans. They are currently running Office 2000. These documents are built using smaller document fragments, and the logic for which fragments are used is based on the details of the particular loan. The data is then pushed into the document using the Word Object Model to find bookmarks and push the data into the relevant bookmarks. They do this in an automated fashion and turn out thousands of these documents a year. They currently have over 70 servers each with Word 2000 installed to turn these documents out in an automated fashion. Word isn't supported running in an unattended fashion, but they've decided to do it anyway (they didn't really have a choice). Now with the new XML formats and the support for custom defined schema, generating these documents will be a snap. It wouldn't even take up one full machine's resources. It will only need to consist of a small bit of code to handle the business logic. The code to build the document itself will only be a few lines.

The last example I have is one that benefits us in Office. Today, we have a couple thousand specifications that we've written for the Office "12" project. For each spec, there are a number of required sections that people need to fill out based on different processes we have for our design. The folks driving any of those processes need to be able to make sure that everyone has filled out the proper sections. When the files were all binary documents, we had to automate Word to be able to do this check. The automation had Word open the file, find the range of text for the specific section, and see if it was filled in. It would take about 8 hours to run the check across those few thousand documents. Because of this we only ran the check every couple of weeks, and it would have to kick off at night when folks were leaving and checked out in the morning. Often the check would fail, so we'd wait until the next night and run it again. At PDC the other week, I showed a similar collection of documents (actually it was only about 300). These documents were all stored in the new format though. I wrote a small about of VB.net (30 lines of code)that iterated over all those document and returned the author, counted all the paragraphs, and counted how many comments there were. To run that solution (which was already more complex than what we were trying to do internally) it took about 1 to 2 seconds. So, if I had increased the collection to 3000, it would have been at most 20 seconds (compared to 8 hours)!

We knew a long time ago that customers and the development community would ask what they could do with the new Office XML formats since they are specifically designed to address scenarios that go beyond the desktop. That is why we decided to take an open and royalty-free approach almost two years ago when we launched Office 2003. There has been a lot of back and forth in this blog on whether we went far enough and whether our motives are pure. It is sort of fun to question motives and pick apart licenses (personally, I'd rather be talking about the design of the formats), but I can tell you that our intent is to make the formats useful to customers and the development community. If we wanted to create a bunch of "gotchas" to trip people up, I think we could have done a better job.

A side benefit of this move is that now that we are creating a new format, we can do a lot of the other things our customers have wanted us to do within the binary formats for the past few releases (which we weren't able to do since we didn't want to break compatibility). Improved robustness; file size; and new features are all added side benefits. I already mentioned how Excel is now able to increase the limits on the number of rows and columns as well as other limitations they had when confined to the existing binary formats. We've also found that using ZIP and XML leads to a significantly more robust file. I've given demos where I delete whole blocks of bits from the files and we're still able to recover the remainder of the content. We see so many benefits to this new format, we often forget to mention all the best parts.

We've been fortunate to get a lot of great support from the public sector for our work. We’ve been working for many years now with governments to understand their needs with XML and they understand what we’ve been doing and our commitment to being open.

Massachusetts is obviously an interesting case and our competitors are having a lot of fun trying to turn this into a bigger story, but from what I've heard, I think some officials at the State were duped. There is no question that this licensing stuff can be really confusing. Just a few months ago, a government official from Massachusetts took a hard look at the Office XML program and publicly stated that his office found it to be "open" and fully consistent with the State's policies. Look here: (https://www.governmentciosummit.ca/GovernmentCIOLeadershipSummit page 23). For the most part, that announcement sort of inspired a yawn around here because our program had already been out for a year and had received a lot of good feedback from other governments. What happened after that? Well, the guy was deluged by lobbyists and influencers who told him that was a bad decision. People told him that the licenses were full of ghosts and scary shadows and bogeymen who would be bad for Massachusetts. It was tough to resist this line of argument because IBM/Lotus and Sun have a big presence in his State. The official himself had also been the CEO of an open source company just before taking office. So, just before he left office (yes, he just took off), he fired off his shot gun with this new policy while running out the door without really thinking through all the implications. Starting to get a picture of what happened? It's actually even a bit uglier that than, but I won't bore you with the details.

Anyway, that's life and we're going to work through the issue. We're already taking an open approach, so we are fundamentally supporting the vision of governments that are interested in open formats. There are also a bunch of smart people in Massachusetts who are trying to do the right thing and we want to work with them in a constructive way. That's our plan.

As usual, I welcome your feedback.

-Brian

Some background information on the reasons we have moved to an XML format as the default in Office "12"

Additional resources