Columnstore Index- Should I partition my columnstore Index?
Table partitioning is a perfect way to manage large tables especially in the context DataWarehouse (DW) which can be very large (think TBs). The table partitioning can help both in managing large amount of dara as well as improving the query performance by eliminating partitions that are not required. For example, if a FACT table stores SALES information for an organization, a common query pattern will be to look at SALES data from the last week or last month or last quarter and so on. In this case, partitioning on say weekly boundary could be helpful as SQL Server will eliminate other partitions during query execution.
The interesting thing with columnstore index is that it is implicitly partitioned by rowgroups, an internal physical organization. Assume that incoming data is in date/time order, then each compressed rowgroups are indirectly ordered by date/time. The columnstore index maintains Min/Max for most columns within a rowgroup. For the case here, the columnstore exactly knows the date/time range for all the rows in each rowgroup. Now, if you run a query to look at the sales data from the last quarter, SQL Server can find those rows efficiently by looking at the metadata for each rowgroup and eliminating the ones that are out of range. This begs the question if you should even consider partitioning a table with CCI as it is already implicitly partitioned. The short answer is 'Yes' as illustrated below
- If there are significant number of updates to compressed data, the implicit date/time ordering will begin to dilute as updated rows will be inserted into the delta store. For example, if you update a row from previous month, it will get deleted there and then inserted along with the latest inserts. So by partitioning the data, you can guarantee that all the data for a given week or month or quarter is in its respective partition.
- If you query the date/time data along with some other attribute. For example, if you have three product lines (P1, P2 and P3) and you are capturing SALES data based on date/time order for all three product lines. Now if you want to query the SALES of last quarter for product line P1, SQL Server will need to process/eliminate the data of P2 and P3. Normally, this is not an issue because columnstore can process the data really fast. However, if you very large number of rows for each product line, and your normal query pattern is to run analytics for a specific product line, then partitioning your columnstore index on Product Line can be beneficial
- Index maintenance can be targeted at partition level. Similarly, other benefits like marking FileGroups containing older partitions as Read Only allows you to control the size of your incremental backups.
- You can compress older partitions with COLUMNSTORE_ARCHIVE option that gives you an additional 30% in storage savings at the cost of slower query performance which may be ok as older partitions may not be queries as often.