Does tag size matter?
Surprisingly, I haven't seen much information out there discussing the performance impacts of XML tag name lengths (ie using "<c>" instead of "<table-cell>"). My last post about some of the design goals behind SpreadsheetML raised some questions from folks about where the time is actually spent when loading an XML file. There are a ton of things that we do to improve the performance when opening and saving Open XML files. The move to using these formats as the default for Office 2007 meant we had to get really serious about how the formats were constructed so they could open efficiently. I'd be really interested to hear from other people who've worked on XML formats if they've had similar experiences.
For a lot of people who have worked with XML, that parsing of tags isn't really more than a percent to two of the overall load and save times. With office document formats, that's not always the case. Just to give you a bit of an idea about the scale these documents can get to, check out the article by George Ou about performance of spreadsheet files: http://blogs.techrepublic.com.com/Ou/?p=120
In that article, the Spreadsheet he uses is pretty big, but we have definitely seen much larger and more complex spreadsheets, so don't assume that it's a fringe case. If you save from the article using the new Open XML format, you get the following:
- File size compressed - 16 megabytes (131 megs uncompressed)
- Number of XML elements - 7,042,830
- Number of attributes - 9,639,043
So, as you can see, that's a lot of XML to parse over. As we looked at files like this, we saw that we absolutely needed to find different ways to optimize the formats to make them faster. Using shorter tag names was one of the first obvious ones.
Impact of long tag names vs. short tag names
In the profiles that we've looked at over the years, we've seen that simply using shorter tag names can significantly improve the performance depending on the type of file. For those of you really interested, you should do your own profiles and let me know what you find out. Remember that for an application like Excel, we're talking about the potential for millions of XML tags to represent a rich spreadsheet. Let's look at a couple issues now:
- Compression - Since we use compression (ZIP), there isn't much a difference in file size since a long tag name and short tag name will pretty much compress to be the same size. This also means that time spent hitting the hard drive or transmitting over the wire will be about equal. When you do the actual compression though, if the tag names are longer, than there are just a lot more bits you need to read through to figure out the compression. These bits may be in memory, they may be on disk, but either way you need to deal with them at some point when compressing them. The same is the case for decompression, you will end up generating a lot more bits if the tag names were longer, even if the compressed bits are significantly smaller.
- Parsing - We of course use a SAX parser to parse our XML, and in most cases we also use a Trie lookup which is super fast (in other cases we use a hash). When using the hash, we of course still have to store the full tag name for a final comparison, because we don't have a known bound set of element values coming in. Not only do we allow for full extensibility, but we also have to also allow for the fact that people might make a mistake when generating the files and we need to be able to catch those errors. For those familiar with hashing, you'll know that unless you are guaranteed a perfect hash, you also need to have a second straight string compare to ensure it was a proper match. So both for memory as well as processing time, tag length has a direct impact. The time taken for a Trie is directly proportional for the tag length. For a hash, it really depends on how you do your hash, and how you do your verification.
- One drawback to the Trie is that it's more memory intensive. In most cases we make that tradeoff though because but it's super fast. You can really see though how tag names have an impact all over the place. There are memory issues, parsing times, compression times, etc.
- Streamed decompression and parsing - As the XML part is being decompressed, we stream it to the parser. SAX is connected directly to the part IStream which then decompresses on demand. On the compression side, it's probably interesting to point out that we don't always compress each XML part as a whole. Instead we keep a single deflate stream and flush the compressed data when the "current" part being written out changes. For most parts we write them out as a whole, but there are some parts where that isn't the case.
I know that for a lot of people who've played around with XML, the parsing isn't really something that you would think of as being a major part of the file load times. This is not the case with office document formats, and especially spreadsheet documents.
With the latest SpreadsheetML design, we've seen that the XML parsing alone (not including our parsing numbers, refs, formulas) can often range from 10-40% of the entire file load. That's just the time it takes to read each tag and each attribute. This shouldn't be too surprising though, as the internal memory structures for a spreadsheet application should be fairly similar to the shapes that are used in the format design. A big piece is just reading the XML in and interpreting what the tags are.
SpreadsheetML was designed so that for any tag or attribute that would appear frequently, we used super short tag names. We also established naming conventions for the abbreviations shared across all three formats (so that they become easier to interpret as you work with them). Elements that may only appear once in a file often have longer tag names, since their size doesn't have nearly the same impact. Right now, most of our frequently used tag names are no more than a couple characters in length. Let's imagine instead we decided to use longer more descriptive names so each tag was around 5 times larger (you can use the older SpreadsheetML or the OpenDocument format for examples of longer tag names):
Short tag example:
Long tag example:
<table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="1"><text:p>1</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="2"><text:p>2</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="3"> <text:p>3</text:p></table:table-cell></table:table-row>
<table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="4"><text:p>4</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="5"><text:p>5</text:p></table:table-cell> <table:table-cell office:value-type="float" office:value="6"><text:p>6</text:p></table:table-cell></table:table-row>
For that example, the top one is using SpreadsheetML from the Ecma Office Open XML format. The second example is using the OpenDocument format. There is another optimization that SpreadsheetML does where you can optionally write out the column and row information on cells, but I removed that since it's actually another performance optimization that I'd like to discuss in a separate future post (and as I said it's optional).
Tag length impact
Let's imagine we have that file I mentioned earlier with 7 million elements and 10 million attributes. If on average each attribute and element is about 2 characters long, then you have 34 megabytes of data to parse (which is a ton), just in tag names and element names. If instead though, the average length of an attribute and element were more like 10 characters, then your talking about 170 megabytes. That is a very significant difference.
This isn't rocket science of course. Most folks I've talked to agree that it's important to keep tag names short, especially in structures that are highly repetitive. In SpreadsheetML, you'll see that a lot of the element names actually are pretty long and descriptive, but only if they appear in a few places, and won't be much of a burden. Any element that can have a high frequency of occurrence is definitely kept to a minimum length.
Optimize based on your design goals
Remember, we're not talking about creating a format for hobbyists. This format is supposed to be used by everyone, and most of those folks aren't going to be happy with feature loss and performance degradation just so they can save out as XML (the average user doesn't care about XML). The original SpreadsheetML from Office XP was actually more like a hobbyist format, and as a result, it was really easy to develop against, but it was bloated and slow. I wish that we didn't have to worry so much about performance, but if you really expect these formats to be used by everyone, then you have to take the training wheels off. That's why the standardization in Ecma is so important though, so that we can ensure that everything is fully documented and all the information is there to allow you to develop against them.
I'll talk more about the other parts of the design that are optimized around performance in future posts. This was one that people had some questions on though so I just wanted to clarify and make sure there wasn't any more confusion. If you've looked at similar issues in your file format designs and found other interesting things like this, I'd love to hear about them!