Building an Azure Analysis Services Model on Top of Azure Blob Storage—Part 3
Part 2 finished with the deployment of a scalable Tabular 1400 model in Azure Analysis Services on top of Azure Blob storage. Part 3 continues the story by attempting to load up to 1 TB of TPC-DS source data into the model--hosted on the largest server that Azure Analysis Services offered at the time of this writing (an S9 with a capacity of 640 Query Processing Units and 400 GB of Cache). Can Azure Analysis Services import 1 TB of source data? How long would processing take and could it be accelerated? Let's find out!
For this part of the story, I work primarily in SQL Server Management Studio (SSMS). SSMS can connect directly to an Azure Analysis Services server by using the fully qualified server name. For this purpose, it supports Active Directory Universal Authentication as well as Active Directory Password Authentication. SSMS enables you to perform administrative actions against a deployed model, such as incremental data loads and other direct operations, through the user interface as well as through scripts. The main tasks include editing data sources, adding partitions, and processing tables. Recent releases of SSMS also include a DAX Query Editor, as introduced in "SSMS Improvements for Analysis Services in the April 2017 Release,?" which is especially convenient if you want to double-check row counts after processing or run other queries. For example, the following screenshot shows a straightforward DAX query to count the rows for each TPC-DS table after a full import.
But before SSMS can show any impressive row counts, it is necessary to get the TPC-DS data into the Tabular model. Initially, I had planned to do this in increments, but I was anxious to see if the full TPC-DS data set could be processed at all, so I decided to go all in at once with an attempt to import the full 1 TB of source data. This required modifying the existing partitions of the large tables in the deployed model (catalog_returns, catalog_sales, inventory, store_returns, store_sales, web_returns, and web_sales) to pull in all 100 data files per table. Accordingly, the ReadBlobData line in the source queries had to be changed from ReadBlobData(Source, "<blob container>", 1, 1) to ReadBlobData(Source, "<blob container>", 1, 100) . By right-clicking on each large table, selecting Partitions, and then clicking on the Edit button in the Partitions dialog box, this task was quickly accomplished. Next, I ran a Tabular Model Scripting Language (TMSL) script to process these seven tables in full, as the following screenshot illustrates.
Processing took roughly 21 hours to complete (see the script execution time in the lower right corner of the SSMS query window above). Certainly, not an impressive processing performance, but it was exciting to see that an S9 Azure Analysis Services server could take a 1 TB TPC-DS data set. The server overallocated about 25 GB of memory (a total of 425 GB), but processing succeeded. After a manual server restart in the Azure Portal to free up any unused memory, the server reallocated approximately 390 GB to load the model. The following graph shows the memory allocation on the server prior and after the restart.
Note that memory allocation is not necessarily equivalent to model size. Especially, the Intel Threading Building Blocks (Intel TBB) allocator might proactively allocate more memory than is strictly needed. The Intel TBB allocator is enabled by default on large Azure Analysis Services servers for best performance and scalability.
Perhaps a more detailed view of the memory consumption is available through the DISCOVER_OBJECT_MEMORY_USAGE schema rowset. Again, the numbers are not always exact, but they do provide a sufficient estimate. Kasper de Jonge published a useful workbook called BISMServerMemoryReport.xlsx that relies on the DISCOVER_OBJECT_MEMORY_USAGE rowset to analyze the memory consumption on an Analysis Services server at any desired level of detail. And thanks to the full compatibility and rather seamless exchangeability of Azure Analysis Services with SQL Server Analysis Services, it is straightforward to use Kasper's workbook to analyze the size of the TPC-DS tables and their columns on an Azure Analysis Services server, as in the screenshot below.
So, 1 TB of TPC-DS source data fit into a 350 GB Tabular model. This is not a sensational compression ratio, but the TPC-DS tables are rather wide and not optimized for column-based compression. Still, smaller models are easier to handle, so I looked for low-hanging fruit to reduce the model size and optimize the data import.
In terms of model size, the first and foremost optimization step is to eliminate unnecessary table columns from the model. As far as TPC-DS is concerned, unnecessary columns are those columns that are not referenced in any of the TPC-DS benchmark queries. Why import columns that aren't participating in any queries? A quick analysis of the benchmark queries revealed that there are quite a few unused columns in the large TPC-DS tables. Furthermore, the BISMServerMemoryReport.xlsx workbook showed that these unused columns consume about 60 GB in the model (see the following spreadsheet). Eliminating these columns would yield nice savings in terms of model size and therefore memory capacity.
To remove these unnecessary columns, I switched back to SSDT Tabular, deleted the columns one at a time by using Tabular Model Explorer (TME), and then redeployed the model with Processing Options set to Full so that SSDT would fully reprocess the model after deployment. Following the deployment, I continued in SSMS as before to update the source queries of the large tables so that the ReadBlobData function would again include all data files for the large tables, and then ran my TMSL processing script one more time.
As anticipated, the resulting model was about 60 GB smaller than before and the server would allocate about 75 GB less memory, as shown below. Note, however, that the processing time did not decrease because the data transfer still included the full 1 TB of source data. This is because the data files first need to be transferred before file parsing can be performed locally on the Azure Analysis Services server. It is only that fewer parsed columns are mapped to table columns, resulting in a smaller model size. If Azure Blob Storage could filter out the unused columns right away, as more sophisticated data sources could, such as Azure SQL Data Warehouse, then the transfer of about 150 GB of raw data could have been avoided and processing time would have been improved as well. But this was not an option.
Given that the files needed to be read from Azure Blob Storage as before, it was not necessary to edit the source queries or modify the ReadBlobData function. As the following diagram illustrates based on the store_sales table, the ReadBlobData function still reads the contents of all the source files and continues to offer the full set of parsed columns to Azure Analysis Services for import. It's just that Azure Analysis Services ignores the ss_net_paid_inc_tax column because it was deleted from the store_sales table in the model.
If Azure Blob Storage does not offer an option to reduce the data volume at the source, then perhaps processing time can be improved by pulling more data into Azure Analysis Services in parallel. For the initial data import, I modified the existing partition on each of the seven large tables to import all 100 files per table. So, Azure Analysis Services processed seven partitions in parallel, which took more than 21 hours to complete. The next test would be to use two partitions per table, each importing 50 files.
In SSMS, connected to the Azure Analysis Services server, I performed the following steps (see also the screenshots below):
- Run Process Clear on all seven large tables to purge the existing data.
- Edit the exiting partition of each table and change the ReadBlobData(Source, " <container name> ", 1, 100) line to ReadBlobData(Source, " <container name> ", 1, 50).
- Create a copy of the partition to add a second partition to each table and change the ReadBlobData(Source, " <container name> ", 1, 50) line to ReadBlobData(Source, " <container name> ", 51, 50).
- Run Process Full on the large tables to import the data again.
As you can see in the screenshot below, using two partitions per table helped to reduce the processing time by roughly 7 hours.
If 2 partitions per table have such a positive effect, then 4 partitions might yield perhaps even more gains. There is, however, one more detail to consider: the maxConnections parameter on the data source. By default, maxConnections is not specified explicitly. The default value is 10. So, Analysis Services establishes a maximum of 10 concurrent connections to the data source by default. Yet, with 7 large tables in the model and 4 partitions each, Analysis Services would need to process 28 partitions in parallel. Hence, it is necessary to adjust the maxConnections setting, as in the screenshot below. Note that the user interface currently does not expose the maxConnections parameter for modern data sources. In the current tools, this parameter must be specified through TMSL or programmatically by using the Tabular Object Model. Note also that maxConnections should not exceed the number of processor cores on the server. With 28 partitions and maxConnections set to 28, the S9 Azure Analysis Services server was able to finish processing in 11 hours and 36 minutes.
Subsequent experiments with higher partition counts (up to 100 partitions per table -- one source file per partition) and additional storage accounts (up to seven -- one for each table) did not produce any further noteworthy gains. As mentioned earlier, processing time could be reduced by switching to a more sophisticated data source, such as Azure SQL Data Warehouse, and then excluding the unnecessary columns at the source. A corresponding test showed that it took no more than an amazing 2 hours and 30 minutes to load the entire data set into an Azure SQL Data Warehouse by using PolyBase, following the steps outlined in the tutorial "Load data with PolyBase in SQL Data Warehouse," and then processing times in Azure Analysis Services could be reduced to around 9 hours. But for the mere joy of processing 1 TB of raw blob data in Azure Analysis Services, 11 hours and 36 minutes was reasonably sufficient.
And that's it for this rather detailed journey about deploying a Tabular 1400 model in Azure Analysis Services on top of a 1 TB TPC-DS data set in Azure Blob Storage. Thanks to the modern Get Data experience, you can build a flexible data import pipeline directly in the model and process even very large data sets within a reasonable timespan. And as always, please deploy the latest monthly release of SSDT Tabular and SSMS and use these tools to take Tabular 1400 in Azure Analysis Services for a test drive. Send us your feedback and suggestions by using ProBIToolsFeedback or SSASPrev at Microsoft.com. Or use any other available communication channels such as UserVoice or MSDN forums. Influence the evolution of the Analysis Services connectivity stack to the benefit of all our customers!