The myth of the Binary Key
I don't know where this myth began, but I have seen enough reference to it at this point that I think it's time to call it out directly. There is no such thing as a binary key that you need to unlock the Microsoft Office XML formats. They are just pure XML files that are fully documented (have been for awhile now). This isn't something where I'm asking you to just trust me; instead you can go and look for yourself. Take Office 2003 and save any Word document you have as XML. Now open that file up in a text editor and take a look. (If you don't have your own copy of Office 2003, try this free lab online that let's you play with the XML functionality).
I'm trying to figure out how this rumor was started, and I have a couple ideas, so let's try and track this down. Let's talk a bit about the format so that you can understand what's there. Take any XML file saved from Word 2003:
- Processing Instruction: As I discussed in this post (http://blogs.msdn.com/brian_jones/archive/2005/07/07/436647.aspx), if you try to open the file in IE, it will most likely be redirected to Word for opening because we put the following declaration in a processing instruction at the top of the file: <?mso-application progid="Word.Document"?> If you want to open the file in IE, you'll need to delete that PI. Is that the mythic binary key folks are talking about? It doesn't affect the way the file is displayed. All it does is tell the shell that Word can open the file.
- Pretty Printing: As I discussed in this post (http://blogs.msdn.com/brian_jones/archive/2005/06/23/432018.aspx), if you open the file in a text editor, you'll see that it's pretty hard to read because we don't "pretty print" the file. You'll either need to remove the PI and open in IE, or open in an editor that has pretty printing built in (like FrontPage or Visual Studio). Maybe this is what has confused people into thinking there is a binary key? It's obviously not though, it's just a way of laying out the XML to make the files more efficient.
- Objects: Word allows you to embed images; video; ActiveX controls; and OLE Objects. These are all foreign to Word though, and when they are stored they need to be stored in their native formats. In 2003, we base64 encode them and store them in a binary tag. For Office 12, since we are using a ZIP package to store the files, we can just keep them as separate binary files within the ZIP (so a JPEG will just be a separate .jpg file in the ZIP). I really doubt this is the "binary key", since it really isn't even owned by us. Any format you create will need to store foreign objects, unless the application decides it's not going to support those features.
- Handful of obscure legacy features: There are a handful of obscure legacy features where certain pieces of the data are stored in a <binData> tag. We did this because of resource constraints when building the original XML file. An example of this would be some of our old legacy fields. We just weren't able to get to them, but we only did this for features where the use of them was very, very low. For Office 12, we've done the extra work so that even these features are now represented in XML. So if this is the binary key, then it will go away, but I highly doubt this would be the "binary key" people talk about as it occurs so rarely.
- VB Project: If you have code that is embedded in your document, that would also be stored as a binary object. This is an area I can understand that some folks might want to see stored as text, but we didn't go that route. In fact, we're moving away from storing code directly in the files in general, as I've already discussed in an earlier blog post (http://blogs.msdn.com/brian_jones/archive/2005/07/12/438262.aspx). The default format won't even have these objects so if this is the "binary key", it's going away. I highly doubt this is the "binary key" though as it has nothing to do with the document itself, just with solutions that run on top of the document, and the majority of documents out there don't have it anyway.
- Namespaces: Someone commented in my last post that the Office files have namespaces in them and if you change the value of the namespace the file behaves in a goofy way. Anyone familiar with XML knows what's going on here, but I understand that a number of you are new to XML. Namespaces are a very important part of the XML standard. They allow you to identify what type of XML you are working with. If it weren't for namespaces, it would be very difficult to work with XML files unless you had control over everything (their creation, storage, and consumption). The point raised here though is really an interesting one. Notice that if you change the namespaces around, Word can still open the file. This is because we support opening all XML files as a result of our custom defined schema support. You can take a WordML file and add your own XML tags in your own namespace, and we'll support opening them, validating them while the file is being edited, and saving them out. The namespace issue obviously isn't a "binary key", and it's one of the major building blocks of XML.
- Byte Order Mark (Unicode) [10/18/2005 - I added this one after it was brought to my attention by Dare] - Dare points out that it could be that some folks unfamiliar with Unicode are having problems with the unicode BOM :
I wouldn't be surprised if the alleged "binary key" was just a byte order mark which caused problems when trying to process the XML file using non-Unicode savvy tools. I suspect some of the ODF folks who had problems with the XML file would get some use out of Sam Ruby's Just Use XML talk at this year's XML 2005 conference.
My theory is that the "binary key" idea came about because someone just took a quick look at file format without really doing their homework. For example, if you combine #2 and #3, you would probably see a binary blob in most files that appears to be at the top. The reason for that is that if the file has a image or some other kind of object in it, and since the file isn't pretty printed, the first line break would come from the base64 encoded data. That would mean that it would look like there is some binary data right at the top. The weird thing here though is that some of the folks that were saying there is a binary key supposedly spent a lot of time looking at all kinds of document formats and investigated them in order to create a universal file format capable of representing every document that ever existed. I would think they would have looked a little closer and seen that there really isn't a "binary key" to unlock the documents. They are already unlocked.
To learn more, go check out the documentation. It's up there for free and anyone can download it. Or play around with the free labs. Or read my "Intro to Word XML" posts. The easiest way for us to have good discussions on these topics is for everyone to actually look into it themselves rather than relying on random news stories. I understand not everyone has the time to look into it, but unfortunately there is a lot of false information out there.