Details on page compression (page-dictionary)

Article
01/18/2008

In the previous blog, I described column-prefix compression that is done as part of PAGE compression. The other component of PAGE compression is the page-dictionary. As the name indicates, this dictionary is specific to a page and the page can be data page or a leaf-index page. The non-leaf pages of an index are ROW compressed even if you choose to compress the index with PAGE compression option. It was a design decision to reduce the CPU impact (it is more expensive to decompress a row with PAGE compression compared to ROW compression) of index tree traversal but on the flip side the height of the index tree could be smaller depending on the data in the index key. Now back to page-dictionary, the topic of this blog.

The dictionary stores common values (same ‘byte’ pattern) across all columns and rows on the page and then each column refers to the dictionary for the value. The entries in page dictionary are stored as an array with 0 based index and can be referenced as 0, 1…and so on. If you have many occurrences of repeating values, you can get proportional compression savings. You may wonder how it is different than column-prefix compression. The key differences are

· Unlike column-prefix, a dictionary value can be referenced by any column of any row on the page. For example, if you have a byte pattern ‘0x11223344 in col1 of row-1 and col2 of row-2, they both can refer to the same dictionary value. You may recall that in column-prefix we look for common prefix in the same column across all rows on the same page.

· A dictionary entry is created only if the value is repeats in its entirety in two or more columns. You may wonder

o What happens if we have the repeating value in the same column across multiple rows on the page? Don’t we create a column-prefix for that? The answer is most likely ‘yes’ unless there is another column-prefix value that provides more space savings. The rule is that we first apply column-prefix compression on the page and then apply page-dictionary compression in that order.

o How do we cleanup dictionary when one of those rows gets deleted? Or when one or more entries in page-dictionary are no longer referenced? Hold that thought for a future blog.

Now let us build on the example that we discussed in the previous blog in column-prefix compression. Here is how the page looked after column-prefix compression.

__________________________________
| HEADER |

|___________________________________
|___________________________________|
| 2BBB | 3BC | <null> |

|_____________|_____________|_______|

| 3BC | 0BBBB | <null> |

|_____________|_____________|_______|

| <null> | <null> | 0BBBB |

|_____________|_____________|_______|

You will notice that the (row1, col2) and (row2, col1) have the same value 3BC. Similarly, (row2, col2) and (ro23, col3) have the same value ‘0BBBB’. These are the only repeating values on the page. Note, we are not considering <null> as they don’t take any storage. The page dictionary is created and stored along with the anchor record and is jointly referred to as ‘Compression Information’ or CI in short but not to be confused with clustered index.

So the modified page after applying page-dictionary compression will look as follows:

__________________________________
| HEADER |

|3BC|0BBBB | | <<-----dictionary

|___________________________________
| |

|___________________________________|
| 2BBB | 0 | <null> |

|_____________|_____________|_______|

|0 | 1 | <null> |

|_____________|_____________|________|

| <null> | <null> | 1 |

|_____________|_____________|________|

(row2, col1) and (row1, col2) have a value 0, the index into the page-dictionary entry with value 3BC and (row2, col2) and (row3, col3) have a value 1, the index into the page-dictionary entry with value 0BBBB. When the value (row2, col1) is to be retrieved as part of say SELECT statement, the SQL Server will perform the following steps

· Access the column value as it is stored in the row2. It retrieves the value 0.

· Looking at the metadata in the row, SQL Server knows that 0 does not mean that the column has a value 0 but that it needs to access first entry in the dictionary. So it retrieves 3BC.

· By looking at the row metadata, SQL Server knows that the value 3 represents the number of bytes to pick from the column prefix in the anchor row, so it computes the value AAABC and returns it.

As you can see that with PAGE compression, we are able to reduce the redundant information on the page significantly but at the cost of extra CPU cycles. Note that the space savings achieved will depend on the data. In fact, two different pages of the same table, you may see different space savings and can even be negative (i.e. the size actually increased after compression). This brings up some interesting questions as follows

· Should I enable PAGE compression? The answer is to use the stored procedure sp_estimate_data_compression_savings to estimate the space savings. Clearly, if there is no or insignificant space savings with PAGE compression, there is no reason to enable it.

· PAGE compression does provide significant space savings on the table but a subset of pages have no data redundancy that can be exploited by PAGE compression. What does SQL Server do in this case? The answer is that SQL Server evaluates space savings with PAGE compression on each page and only-if the space savings is greater than an internally defined threshold, it is compressed. So a likely case is that your table with PAGE compression option will have a subset of pages that are PAGE compressed and others not.

In summary of this blog, let me emphasize that the space savings is data dependent and that the space savings come at the cost of additional CPU incurred when retrieving or manipulating compressed data. In the next blog, I will describe common scenarios for enabling compression to achieve a good balance between space savings and CPU overhead. As before, please feel free to send me your feedback/comments

Details on page compression (page-dictionary)

Additional resources