This reference architecture implements an extract, load, and transform (ELT) pipeline that moves data from an on-premises SQL Server database into Azure Synapse and transforms the data for analysis.
A reference implementation for this architecture is available on GitHub.
Scenario: An organization has a large OLTP data set stored in a SQL Server database on premises. The organization wants to use Azure Synapse to perform analysis using Power BI.
This reference architecture is designed for one-time or on-demand jobs. If you need to move data on a continuing basis (hourly or daily), we recommend using Azure Data Factory to define an automated workflow. For a reference architecture that uses Data Factory, see Automated enterprise BI with Azure Synapse and Azure Data Factory.
The architecture consists of the following components.
SQL Server. The source data is located in a SQL Server database on premises. To simulate the on-premises environment, the deployment scripts for this architecture provision a VM in Azure with SQL Server installed. The Wide World Importers OLTP sample database is used as the source data.
Ingestion and data storage
Blob Storage. Blob storage is used as a staging area to copy the data before loading it into Azure Synapse.
Azure Synapse. Azure Synapse is a distributed system designed to perform analytics on large data. It supports massive parallel processing (MPP), which makes it suitable for running high-performance analytics.
Analysis and reporting
Azure Analysis Services. Analysis Services is a fully managed service that provides data modeling capabilities. Use Analysis Services to create a semantic model that users can query. Analysis Services is especially useful in a BI dashboard scenario. In this architecture, Analysis Services reads data from the data warehouse to process the semantic model, and efficiently serves dashboard queries. It also supports elastic concurrency, by scaling out replicas for faster query processing.
Currently, Azure Analysis Services supports tabular models but not multidimensional models. Tabular models use relational modeling constructs (tables and columns), whereas multidimensional models use OLAP modeling constructs (cubes, dimensions, and measures). If you require multidimensional models, use SQL Server Analysis Services (SSAS). For more information, see Comparing tabular and multidimensional solutions.
Power BI. Power BI is a suite of business analytics tools to analyze data for business insights. In this architecture, it queries the semantic model stored in Analysis Services.
Azure Active Directory (Azure AD) authenticates users who connect to the Analysis Services server through Power BI.
This reference architecture uses the WorldWideImporters sample database as a data source. The data pipeline has the following stages:
- Export the data from SQL Server to flat files (bcp utility).
- Copy the flat files to Azure Blob Storage (AzCopy).
- Load the data into Azure Synapse (PolyBase).
- Transform the data into a star schema (T-SQL).
- Load a semantic model into Analysis Services (SQL Server Data Tools).
For steps 1 – 3, consider using Redgate Data Platform Studio. Data Platform Studio applies the most appropriate compatibility fixes and optimizations, so it's the quickest way to get started with Azure Synapse. For more information, see Load data with Redgate Data Platform Studio.
The next sections describe these stages in more detail.
Export data from SQL Server
The bcp (bulk copy program) utility is a fast way to create flat text files from SQL tables. In this step, you select the columns that you want to export, but don't transform the data. Any data transformations should happen in Azure Synapse.
If possible, schedule data extraction during off-peak hours, to minimize resource contention in the production environment.
Avoid running bcp on the database server. Instead, run it from another machine. Write the files to a local drive. Ensure that you have sufficient I/O resources to handle the concurrent writes. For best performance, export the files to dedicated fast storage drives.
You can speed up the network transfer by saving the exported data in Gzip compressed format. However, loading compressed files into the warehouse is slower than loading uncompressed files, so there is a tradeoff between faster network transfer versus faster loading. If you decide to use Gzip compression, don't create a single Gzip file. Instead, split the data into multiple compressed files.
Copy flat files into blob storage
The AzCopy utility is designed for high-performance copying of data into Azure blob storage.
Create the storage account in a region near the location of the source data. Deploy the storage account and the Azure Synapse instance in the same region.
Don't run AzCopy on the same machine that runs your production workloads, because the CPU and I/O consumption can interfere with the production workload.
Test the upload first to see what the upload speed is like. You can use the /NC option in AzCopy to specify the number of concurrent copy operations. Start with the default value, then experiment with this setting to tune the performance. In a low-bandwidth environment, too many concurrent operations can overwhelm the network connection and prevent the operations from completing successfully.
AzCopy moves data to storage over the public internet. If this isn't fast enough, consider setting up an ExpressRoute circuit. ExpressRoute is a service that routes your data through a dedicated private connection to Azure. Another option, if your network connection is too slow, is to physically ship the data on disk to an Azure datacenter. For more information, see Transferring data to and from Azure.
During a copy operation, AzCopy creates a temporary journal file, which enables AzCopy to restart the operation if it gets interrupted (for example, due to a network error). Make sure there is enough disk space to store the journal files. You can use the /Z option to specify where the journal files are written.
Load data into Azure Synapse
Use PolyBase to load the files from blob storage into the data warehouse. PolyBase is designed to leverage the MPP (Massively Parallel Processing) architecture of Azure Synapse, which makes it the fastest way to load data into Azure Synapse.
Loading the data is a two-step process:
- Create a set of external tables for the data. An external table is a table definition that points to data stored outside of the warehouse — in this case, the flat files in blob storage. This step does not move any data into the warehouse.
- Create staging tables, and load the data into the staging tables. This step copies the data into the warehouse.
Consider Azure Synapse when you have large amounts of data (more than 1 TB) and are running an analytics workload that will benefit from parallelism. Azure Synapse is not a good fit for OLTP workloads or smaller data sets (less than 250 GB). For data sets less than 250 GB, consider Azure SQL Database or SQL Server. For more information, see Data warehousing.
Create the staging tables as heap tables, which are not indexed. The queries that create the production tables will result in a full table scan, so there is no reason to index the staging tables.
PolyBase automatically takes advantage of parallelism in the warehouse. The load performance scales as you increase DWUs. For best performance, use a single load operation. There is no performance benefit to breaking the input data into chunks and running multiple concurrent loads.
PolyBase can read Gzip compressed files. However, only a single reader is used per compressed file, because uncompressing the file is a single-threaded operation. Therefore, avoid loading a single large compressed file. Instead, split the data into multiple compressed files, in order to take advantage of parallelism.
Be aware of the following limitations:
PolyBase supports a maximum column size of
varbinary(8000). If you have data that exceeds these limits, one option is to break the data up into chunks when you export it, and then reassemble the chunks after import.
PolyBase uses a fixed row terminator of \n or newline. This can cause problems if newline characters appear in the source data.
Your source data schema might contain data types that are not supported in Azure Synapse.
To work around these limitations, you can create a stored procedure that performs the necessary conversions. Reference this stored procedure when you run bcp. Alternatively, Redgate Data Platform Studio automatically converts data types that aren't supported in Azure Synapse.
For more information, see the following articles:
- Best practices for loading data into Azure Synapse.
- Migrate your schemas to Azure Synapse
- Guidance for defining data types for tables in Azure Synapse
Transform the data
Transform the data and move it into production tables. In this step, the data is transformed into a star schema with dimension tables and fact tables, suitable for semantic modeling.
Create the production tables with clustered columnstore indexes, which offer the best overall query performance. Columnstore indexes are optimized for queries that scan many records. Columnstore indexes don't perform as well for singleton lookups (that is, looking up a single row). If you need to perform frequent singleton lookups, you can add a non-clustered index to a table. Singleton lookups can run significantly faster using a non-clustered index. However, singleton lookups are typically less common in data warehouse scenarios than OLTP workloads. For more information, see Indexing tables in Azure Synapse.
Clustered columnstore tables do not support
varbinary(max) data types. In that case, consider a heap or clustered index. You might put those columns into a separate table.
Because the sample database is not very large, we created replicated tables with no partitions. For production workloads, using distributed tables is likely to improve query performance. See Guidance for designing distributed tables in Azure Synapse. Our example scripts run the queries using a static resource class.
Load the semantic model
Load the data into a tabular model in Azure Analysis Services. In this step, you create a semantic data model by using SQL Server Data Tools (SSDT). You can also create a model by importing it from a Power BI Desktop file. Because Azure Synapse does not support foreign keys, you must add the relationships to the semantic model, so that you can join across tables.
Use Power BI to visualize the data
Power BI supports two options for connecting to Azure Analysis Services:
- Import. The data is imported into the Power BI model.
- Live Connection. Data is pulled directly from Analysis Services.
We recommend Live Connection because it doesn't require copying data into the Power BI model. Also, using DirectQuery ensures that results are always consistent with the latest source data. For more information, see Connect with Power BI.
Avoid running BI dashboard queries directly against the data warehouse. BI dashboards require very low response times, which direct queries against the warehouse may be unable to satisfy. Also, refreshing the dashboard will count against the number of concurrent queries, which could impact performance.
Azure Analysis Services is designed to handle the query requirements of a BI dashboard, so the recommended practice is to query Analysis Services from Power BI.
With Azure Synapse, you can scale out your compute resources on demand. The query engine optimizes queries for parallel processing based on the number of compute nodes, and moves data between nodes as necessary. For more information, see Manage compute in Azure Synapse.
For production workloads, we recommend the Standard Tier for Azure Analysis Services, because it supports partitioning and DirectQuery. Within a tier, the instance size determines the memory and processing power. Processing power is measured in Query Processing Units (QPUs). Monitor your QPU usage to select the appropriate size. For more information, see Monitor server metrics.
Under high load, query performance can become degraded due to query concurrency. You can scale out Analysis Services by creating a pool of replicas to process queries, so that more queries can be performed concurrently. The work of processing the data model always happens on the primary server. By default, the primary server also handles queries. Optionally, you can designate the primary server to run processing exclusively, so that the query pool handles all queries. If you have high processing requirements, you should separate the processing from the query pool. If you have high query loads, and relatively light processing, you can include the primary server in the query pool. For more information, see Azure Analysis Services scale-out.
To reduce the amount of unnecessary processing, consider using partitions to divide the tabular model into logical parts. Each partition can be processed separately. For more information, see Partitions.
IP allow list of Analysis Services clients
Consider using the Analysis Services firewall feature to allow list client IP addresses. If enabled, the firewall blocks all client connections other than those specified in the firewall rules. The default rules allow list the Power BI service, but you can disable this rule if desired. For more information, see Hardening Azure Analysis Services with the new firewall capability.
Azure Analysis Services uses Azure Active Directory (Azure AD) to authenticate users who connect to an Analysis Services server. You can restrict what data a particular user is able to view, by creating roles and then assigning Azure AD users or groups to those roles. For each role, you can:
- Protect tables or individual columns.
- Protect individual rows based on filter expressions.
For more information, see Manage database roles and users.
Create separate resource groups for production, development, and test environments. Separate resource groups make it easier to manage deployments, delete test deployments, and assign access rights.
Use the Azure Building blocks templates provided in this architecture or create Azure Resource Manager template to deploy the Azure resources following the infrastructure as Code (IaC) Process. With templates, automating deployments using Azure DevOps Services, or other CI/CD solutions is easier.
Put each workload in a separate deployment template and store the resources in source control systems. You can deploy the templates together or individually as part of a CI/CD process, making the automation process easier.
In this architecture, there are three main workloads:
- The data warehouse server, Analysis Services, and related resources.
- Azure Data Factory.
- An on-premises to cloud simulated scenario.
Each workload has its own deployment template.
The data warehouse server is set up and configured by using Azure CLI commands which follows the imperative approach of the IaC practice. Consider using deployment scripts and integrate them in the automation process.
Consider staging your workloads. Deploy to various stages and run validation checks at each stage before moving to the next stage. That way you can push updates to your production environments in a highly controlled way and minimize unanticipated deployment issues. Use Blue-green deployment and Canary releases strategies for updating live production environments.
Have a good rollback strategy for handling failed deployments. For example, you can automatically redeploy an earlier, successful deployment from your deployment history. See the --rollback-on-error flag parameter in Azure CLI.
Azure Monitor is the recommended option for analyzing the performance of your data warehouse and the entire Azure analytics platform for an integrated monitoring experience. Azure Synapse Analytics provides a monitoring experience within the Azure portal to show insights to your data warehouse workload. The Azure portal is the recommended tool when monitoring your data warehouse because it provides configurable retention periods, alerts, recommendations, and customizable charts and dashboards for metrics and logs.
For more information, see the DevOps section in Microsoft Azure Well-Architected Framework.
Choose Compute Optimized Gen1 for frequent scaling operations. This option is priced as pay-as-you-go, based on Data warehouse units consumption (DWU).
Choose Compute Optimized Gen2 for intensive workloads with higher query performance and compute scalability needs. You can choose the pay-as-you-go model or use reserved plans of one year (37% savings) or 3 years (65% savings).
Data storage is charged separately. Other services such as disaster recovery and threat detection are also charged separately.
For more information, see Azure Synapse Pricing.
Azure Analysis Services
Pricing for Azure Analysis Services depends on the tier. The reference implementation of this architecture uses the Developer tier, which is recommended for evaluation, development, and test scenarios. Other tiers include, the Basic tier, which is recommended for small production environment; the Standard tier for mission-critical production applications. For more information, see The right tier when you need it.
No charges apply when you pause your instance.
For more information, see Azure Analysis Services pricing.
Consider using the Azure Storage reserved capacity feature to lower cost on storage. With this model, you get a discount if you can commit to reservation for fixed storage capacity for one or three years. For more information, see Optimize costs for Blob storage with reserved capacity.
Power BI Embedded
Power BI Embedded is a Platform-as-a-Service (PaaS) solution that offers a set of APIs to enable the integration of Power BI content into custom apps and websites. Users who publish BI content need to be licensed with Power BI Pro. For information about pricing, see Power BI Embedded pricing.
For more information, see the Cost section in Microsoft Azure Well-Architected Framework.
Deploy the solution
To the deploy and run the reference implementation, follow the steps in the GitHub readme. It deploys the following resources:
- A Windows VM to simulate an on-premises database server. It includes SQL Server 2017 and related tools, along with Power BI Desktop.
- An Azure storage account that provides Blob storage to hold data exported from the SQL Server database.
- An Azure Synapse instance.
- An Azure Analysis Services instance.
- Use Azure Data Factory to automate the ELT pipeline. See Automated enterprise BI with Azure Synapse and Azure Data Factory.
You may want to review the following Azure example scenarios that demonstrate specific solutions using some of the same technologies: