Data compression techniques and trade offs
Ok, now that we have sort of agreed (http://blogs.msdn.com/sqlserverstorageengine/archive/2007/09/30/data-compression-why-do-we-need-it.aspx) that data compression is a good thing, you may wonder how SQL Server compresses the data, what does this compression mean to my data and to my workload?
If you are a theory nut or crave for mathematics, you can browse the web or read the related books and find many fine techniques to compress the data but note, not all compression techniques are applicable to databases. I finished my engineering degree almost 30 years ago and, thanks to the computer science profession and the hours spent in coding/testing, developing products and now with my work as a program manager, I have lost touch (my excuse) with physics/mathematics, the subjects I used to love the most in college.
In general, there are two kinds of compression as follows
· Lossy: This causes some loss of data when it is uncompressed. This is done to get fast/better compression. It is more suitable for compressing images, music and videos files where some loss in fidelity is undetectable to human eye/ears. One example is JPEG. Clearly, this is not suitable for documents or for databases.
· Lossless: This allows original data to be reconstructed exactly from the compressed format. This is what we need for data in the database. There are two common techniques here. First technique is based on the encoding the input data in such a way that more commonly occurring data is coded using less number of bytes. One example of one such technique is Huffman Coding. Second technique is based on statistical modeling for text data. One example in this category is LZ (Lempel-Ziv) algorithm. There are many versions of it but the central idea is to build a dictionary of common occurring symbols in a block and then reference these using a pointer. So the commonly occurring symbol need only be stored once there by saving the space. SQL Server uses its own proprietary implementation of dictionary based algorithm.
Let us first get two important points out of the way.
1. The goal of compression in database is NOT to get the maximum compression possible but to find a right balance between the compression achieved and the cost of compressing/de-compressing the data. It will serve you no good if SQL Server can compress the data 90% but the CPU cost of SELECT and DML operations become unacceptably high. You often hear that so and so database system compresses the data 60% while the other database system compresses the same data 40%. The important question to ask is what is the impact on the workload (i.e. CPU)? If the impact on the workload is unacceptable, the customers will not use the data compression in spite of better compression. Ideally, we need to strike a right balance between the compression achieved and its CPU impact. For this reason, SQL Server team evaluated various compression techniques, estimated the data compression we could achieve for different DW databases, estimated the CPU impact of compression/decompression and then zeroed-in to the types of compression to deliver what we have in SQL Server 2008. It is not to say that we won’t enhance compression in future. We will always look for opportunities to enhance our data compression offering. So please do always feel free to contact us with your ideas and the kinds of data challenges you are facing.
2. The compression achieved depends on the data distribution and the schema. So for example, if all integer values that appear in a column can fit in 1 byte and you have declared the column type to be INT, you can possibly save 3 bytes (i.e. 75%) barring any row-overhead per column values. On the other hand, if you had declared the column to be tinyint (i.e. 1 byte), then there are no savings with compression. So the same data but different schema will give you different perception about the compression. Similarly, for the data distribution.
In the next BLOG I will provide more details on types of compression in SQL Server 2008