XML: Typed and Untyped
Occasionally we hear from people who are surprised to find that their XML data uses more space when typed than when untyped. In general, this is to be expected. Typed XML has some advantages over untyped XML, namely smarter query plans and the ability to constrain user input, but size usually isn't one of them. In addition to storing all the same markup information needed for untyped XML, typed XML also has type annotations. Storing values in their binary forms rather than as Unicode strings does offer some opportunities for savings, but these savings usually aren't enough to make up for the extra space needed for type annotations. The space requirements for typed values in bytes (excluding markup) are as follows:
- xs:boolean - 1 byte
- xs:decimal - 17 bytes
- xs:float - 4 bytes
- xs:double - 8 bytes
- xs:date/xs:time/xs:dateTime - 8 bytes
- xs:hexBinary - 1 byte per 2 characters of input
- xs:base64Binary - 3 bytes per 4 characters of input
Note that all types derived from xs:decimal (xs:integer, xs:long, xs:byte, etc.) take 17 bytes. All other types, as well as all text nodes in untyped XML, are stored as Unicode strings, and use two bytes per character. The biggest opportunities for saving are with binary types (xs:hexBinary values takes up four times as much space when untyped), and to a lesser extent xs:dateTime.
The size of the type annotation varies. For attributes and elements with complex content, the annotation is seven bytes: One for the extension token (indicating an extension of the standard binary XML format), one for the number of bytes in the extension, one for a set of flags, and four for the type ID.
For elements with simple content, there are two type annotations: one for the element itself, and one for the value. The annotation for the value is the same as the annotation for an attribute: seven bytes. The annotation for the element itself is 11 bytes, the the first seven being the same as for attributes and values, and the remaining four being a pointer to the value. The pointer is necessary because an arbitrary number of attributes may come between the element and its value. This gives a total of 18 bytes for elements with simple content.
So an element typed as xs:decimal will require 35 bytes between the value and type annotation. Only if the value is at least 18 digits long does this give any savings over the untyped representation. On the other hand, an attribute of type xs:dateTime needs only 15 bytes for its value and annotation if typed, whereas its string representation would require 40-58 bytes. While it's possible in principle for a document to be smaller as typed XML than as untyped XML, space savings should not generally be expected without a specific reason, such as use of large binary values or many dateTime values.