Dealing with large row sets in the tabular designer
Amir Netz’s two billion and seven rows demo at last year’s BI conference was not just demoware. Unlike PowerPivot, where the amount of data you can load is bound by the workbook size, the amount of data you can load into tabular models is limited only by server capacity. This is true at both design time and at deployment time.
If you have a workspace database server that’s got enough RAM to hold and manipulate your billions of rows, you really can sort, filter, add columns, etc on your tabular model from the designer today. This is one of the big advantages of the out of proc server architecture. You can use a standard developer desktop, connect to a beefy workspace database server, and still get decent responsiveness for a large row set. The size of the metadata (number of tables, columns, etc) has a much greater impact on the performance of the designer than the size of the data.
There are a couple of things to keep in mind when working with large row sets. First, you should ensure that your Data Backup property is set to “Do not backup to disk” before dealing with models this size. This is the default, so don’t worry too much here. Data backups are useful sometimes (I’ll explain later), but the gains to be had by backing up are lost by the longer save times and higher disk space requirements on developer workstations.
Also, the tabular designer works against the metadata and the data, and there is no way to separate the two. That means the mandatory first step for any tabular project is to import from a data source, which processes a bunch of data. Although working with the data is fast enough, processing the data still takes time.
If you would like to cut your data set down to size to reduce processing time, you can do one or more of the following:
- Option 1: Create a separate database that contains a subset of the production data. Use this subset of data in the development environment. When you deploy to production, use the deployment wizard or some other deployment script to change the connection strings before you go.
- Option 2: Create views that contain a subset of the production data. Import from this view. Again, when you deploy, adjust the connections used in your model
- Option 3: When using the import wizard, import only the data you want in the first partition for your model. To do this, adjust your SQL query to limit the number of rows returned or apply a filter on the Preview and Filter page. You can then add more partitions (either in BIDS before deployment or in SSMS after deployment) and load data only after you have deployed your model.
That last option, importing the first partition only and then adding more later, is a very simple and powerful way to create a subset of your data. I will go through this step-by-step in my next post. It is definitely not the only way to work with large data, but it is perhaps the most accessible way to get up and running quickly with a large row set.
[Edit 9/1 - the import subset using partitions post is here]