Data warehousing and analytics

Azure Data Lake Storage
Azure Cosmos DB
Azure Data Factory
Azure SQL Database
Azure Table Storage

This example scenario demonstrates a data pipeline that integrates large amounts of data from multiple sources into a unified analytics platform in Azure. This specific scenario is based on a sales and marketing solution, but the design patterns are relevant for many industries requiring advanced analytics of large datasets such as e-commerce, retail, and healthcare.

Architecture

Architecture for a data warehousing and analysis scenario in Azure

Download a Visio file of this architecture.

Dataflow

The data flows through the solution as follows:

  1. For each data source, any updates are exported periodically into a staging area in Azure Data Lake Storage.
  2. Azure Data Factory incrementally loads the data from Azure Data Lake Storage into staging tables in Azure Synapse Analytics. The data is cleansed and transformed during this process. PolyBase can parallelize the process for large datasets.
  3. After loading a new batch of data into the warehouse, a previously created Azure Analysis Services tabular model is refreshed. This semantic model simplifies the analysis of business data and relationships.
  4. Business analysts use Microsoft Power BI to analyze warehoused data via the Analysis Services semantic model.

Components

The company has data sources on many different platforms:

  • SQL Server on-premises
  • Oracle on-premises
  • Azure SQL Database
  • Azure table storage
  • Azure Cosmos DB

Data is loaded from these different data sources using several Azure components:

  • Azure Data Lake Storage is used to stage source data before it's loaded into Azure Synapse.
  • Data Factory orchestrates the transformation of staged data into a common structure in Azure Synapse. Data Factory uses PolyBase when loading data into Azure Synapse to maximize throughput.
  • Azure Synapse is a distributed system for storing and analyzing large datasets. Its use of massive parallel processing (MPP) makes it suitable for running high-performance analytics. Azure Synapse can use PolyBase to rapidly load data from Azure Data Lake Storage.
  • Analysis Services provides a semantic model for your data. It can also increase system performance when analyzing your data.
  • Power BI is a suite of business analytics tools to analyze data and share insights. Power BI can query a semantic model stored in Analysis Services, or it can query Azure Synapse directly.
  • Microsoft Entra ID authenticates users who connect to the Analysis Services server through Power BI. Data Factory can also use Microsoft Entra ID to authenticate to Azure Synapse via a service principal or Managed identity for Azure resources.

Alternatives

Scenario details

This example demonstrates a sales and marketing company that creates incentive programs. These programs reward customers, suppliers, salespeople, and employees. Data is fundamental to these programs, and the company wants to improve the insights gained through data analytics using Azure.

The company needs a modern approach to analysis data, so that decisions are made using the right data at the right time. The company's goals include:

  • Combining different kinds of data sources into a cloud-scale platform.
  • Transforming source data into a common taxonomy and structure, to make the data consistent and easily compared.
  • Loading data using a highly parallelized approach that can support thousands of incentive programs, without the high costs of deploying and maintaining on-premises infrastructure.
  • Greatly reducing the time needed to gather and transform data, so you can focus on analyzing the data.

Potential use cases

This approach can also be used to:

  • Establish a data warehouse to be a single source of truth for your data.
  • Integrate relational data sources with other unstructured datasets.
  • Use semantic modeling and powerful visualization tools for simpler data analysis.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

The technologies in this architecture were chosen because they met the company's requirements for scalability and availability, while helping them control costs.

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.

Review a pricing sample for a data warehousing scenario via the Azure pricing calculator. Adjust the values to see how your requirements affect your costs.

  • Azure Synapse allows you to scale your compute and storage levels independently. Compute resources are charged per hour, and you can scale or pause these resources on demand. Storage resources are billed per terabyte, so your costs will increase as you ingest more data.
  • Data Factory costs are based on the number of read/write operations, monitoring operations, and orchestration activities performed in a workload. Your Data Factory costs will increase with each additional data stream and the amount of data processed by each one.
  • Analysis Services is available in developer, basic, and standard tiers. Instances are priced based on query processing units (QPUs) and available memory. To keep your costs lower, minimize the number of queries you run, how much data they process, and how often they run.
  • Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included in the pricing sample above.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributor.

Principal author:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps