Processing tabular models 101
With a new storage engine comes new processing options. Here are some fast facts about processing tabular models, in Q&A format.
What are my processing options?
The following table shows the available processing options for each object.
|Object||Available Processing Options|
|Database||Process Clear, Process Default, Process Defrag*, Process Full, Process Recalc *See usage note in the Process Defrag section|
|Table||Process Clear, Process Data, Process Default, Process Defrag, Process Full|
|Partition||Process Add, Process Clear, Process Data, Process Default, Process Full|
What do the processing options mean?
The following table describes each processing option.
|Process Add||Adds new rows to a partition. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated.|
|Process Clear||Drops all the data in a database, table, or partition.|
|Process Data||Loads data into a partition or table.|
|Process Default||Loads data into unprocessed partitions or tables. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated.|
|Process Defrag||Optimizes the table dictionary (an internal engine structure) for a given table or for all tables in the database*. This operation removes all dictionary entries that no longer exist in the data set and rebuilds the partition data based on the new dictionaries. *See usage note in the Process Defrag section|
|Process Full||Loads data into all selected partitions or tables. Any affected calculated columns, relationships, user hierarchies, or internal engine structures (except table dictionaries) are recalculated.|
|Process Recalc||For all tables in the database, recalculates calculated columns, rebuilds relationships. rebuilds user hierarchies, and rebuilds other internal engine structures. Table dictionaries are not affected.|
When should I Process Recalc?
Any time after you do a Process Clear or Process Data, you must issue a Process Recalc. Also, after you merge partitions in script, you must perform a Process Recalc. Your model may be unqueryable if you skip the Process Recalc. Process Recalc is always the last thing you do, and you can batch up as many processing operations as you want before issuing that last Process Recalc.
You do not need to Process Recalc if you do a Process Add, Process Full, or Process Default. All of those commands recalculate the affected objects (and only the affected objects) for you. Also, you don’t need to issue a Process Recalc after you merge partitions through the SSMS UI, as the UI automatically executes a Process Recalc for you.
Keep in mind that you can only issue Process Recalc at the database level. This means that after a Process Data operation on a dimension, you may be stuck recalculating a large fact table when it was not impacted by the initial processing operation. If you want to limit the scope of recalculation, you might want to do a Process Full or Process Default instead of Process Data. However there is a tradeoff – if you perform multiple Process Full or Process Default operations in a row, the engine performs the recalculation every time. So your choice is multiple small recalcs every time vs. a single large recalc – choose wisely based on your specific processing scenario.
When should I Process Defrag?
If you often add or remove partitions from a table, or if you often reprocess individual partitions in a table, you should Process Defrag regularly but judiciously. Process Defrag may be an expensive operation for a large table.
If you Process Data on a table level, it is never necessary to do a Process Defrag, as nice clean dictionaries are built when you process the table.
Here is an example where performing a Process Defrag is helpful. Imagine you have a table with 12 partitions, containing a rolling window of monthly data for a year. When January 2012 comes, you drop the January 2011 partition and add a new one for January 2012. However, deleting the January 2011 partition does not delete the column values in the dictionary for the table, even though there may not be any rows in the table that reference these values. The dictionary is now too large. Issuing a Process Defrag on this table gets rid of the useless entries, thus saving space and improving performance.
Usage Note: You can Process Defrag a database, even though that option is not exposed in the UI. This operation has a side effect in Denali – the data for unprocessed tables is loaded as well. You need to be aware of this side effect before issuing a Process Defrag on a database.
How do I perform incremental processing?
There are two ways to incrementally add data to a table:
- Create a new partition and then use Process Data to populate it. This new partition can then be merged with an existing partition if you like.
- Use Process Add to add data to the existing partition.
If you are using Process Add, you should keep in mind that there is no UI for adding a specific query binding to use for the Process Add operation. So to make sure you’re not adding duplicate rows, you need to do one of the following:
- Change the partition definition to the query that returns the set of rows you want returned by the incremental processing operation. This can be temporary, after you have finished processing you can always change the query back without marking the partition as unprocessed. A processed partition with a dirty partition query is never re-processed with a Process Default operation on its parent table or database, though issuing a Process Default on that specific partition will trigger reprocessing.
- If your partition definition is based on a view, make sure that the view on the original data source now returns just the set of rows you want returned by the incremental processing operation.
No matter how you do incremental processing, you must Process Recalc on the database afterwards before moving on with your life. If you do this enough, you’ll want to Process Defrag too.
Is parallel processing supported for Denali?
Parallel processing is supported for different tables, but not for different partitions within a table. This restriction is enforced by the engine when scheduling the processing job. You don’t get any notification of this restriction.
To clarify with an example. Say you send the engine a parallel batch processing command. You ask the engine to process Table A with Partitions 1, 2, 3 and Table B with partitions 4, 5, 6. The engine then goes and schedules your processing job. It is possible for partitions 1 (from Table A) and 4 (from Table B) to be processed in parallel. However, because processing is scheduled sequentially within a table, you will never see partitions 1 and 2 or 1 and 3 processed in parallel.
What does a processing error like “A duplicate attribute key has been found when processing” mean?
This error means you are trying to bring duplicate rows into a table that must have a column that is unique.
There are a few situations in which you must have a unique column in a table:
- The table is in a relationship and it is acting as a lookup table
- There is a column in the table marked as a “Row Identifier” column
- You have sorted a column by another column in a table
- You have marked a table as a date table
In all of the above cases, you are not allowed to have duplicate rows in a table. Processing will fail if you try.
How do I automate processing?
See my post on managing tabular models for some automation options.
Are there any aggregations done at processing time?
No. Everything is done on the fly at query time.
Thank you Akshai Mirchandani and Edward Melomed for providing source material via email for this post.
Edited 9/30 to add the aggregation question
Edited 1/23/2012 - Correcting an error. Process Add leaves the model in a queryable state and doesn't require a Process Recalc.