Analytics end-to-end with Azure Synapse

Synapse Analytics
Cosmos DB
Data Factory
Databricks
Event Hubs

This example scenario demonstrates how to use Azure Synapse Analytics with the extensive family of Azure Data Services to build a modern data platform that's capable of handling the most common data challenges in an organization.

The solution described in this article combines a range of Azure services that will ingest, store, process, enrich, and serve data and insights from different sources (structured, semi-structured, unstructured, and streaming).

Potential use cases

This approach can also be used to:

  • Establish a data product architecture, which consists of a data warehouse for structured data and a data lake for semi-structured and unstructured data. You can choose to deploy a single data product for centralized environments or multiple data products for distributed environments such as Data Mesh. See more information about Data Management and Data Landing Zones.
  • Integrate relational data sources with other unstructured datasets, with the use of big data processing technologies.
  • Use semantic modeling and powerful visualization tools for simpler data analysis.
  • Share datasets within the organization or with trusted external partners.
  • Implement knowledge mining solutions to extract valuable business information hidden in images, PDFs, documents, and so on.

Architecture

Architecture for a modern data platform using Azure data services

Download a Visio file of this architecture.

Note

  • The services covered by this architecture are only a subset of a much larger family of Azure services. Similar outcomes can be achieved by using other services or features that are not covered by this design.
  • Specific business requirements for your analytics use case could require the use of different services or features that are not considered in this design.

Deploy the architecture

This deployment accelerator gives you the option to implement the entire reference architecture or choose what workloads you need for your analytics use case. You also have the option to select whether services are accessible via public endpoints or if they are to be accessed only via private endpoints.

Use the following button to deploy the reference using the Azure portal.

Deploy to Azure

For detailed information and additional deployment options, see the deployment accelerator GitHub repo with documentation and code used to define this solution.

Analytics use cases

The analytics use cases covered by the architecture are illustrated by the different data sources on the left-hand side of the diagram. Data flows through the solution from the bottom up as follows:

Azure data services, cloud native HTAP with Cosmos DB and Dataverse

Process

  1. Azure Synapse Link for Azure Cosmos DB and Azure Synapse Link for Dataverse enable you to run near real-time analytics over operational and business application data, by using the analytics engines that are available from your Azure Synapse workspace: SQL Serverless and Spark Pools.

  2. When using Azure Synapse Link for Cosmos DB, use either a SQL Serverless query or a Spark Pool notebook. You can access the Cosmos DB analytical store and then combine datasets from your near real-time operational data with data from your data lake or from your data warehouse.

  3. When using Azure Synapse Link for Dataverse, use either a SQL Serverless query or a Spark Pool notebook. You can access the selected Dataverse tables and then combine datasets from your near real-time business applications data with data from your data lake or from your data warehouse.

Store

  1. The resulting datasets from your SQL Serverless queries can be persisted in your data lake. If you are using Spark notebooks, the resulting datasets can be persisted either in your data lake or data warehouse (SQL pool).

Serve

  1. Load relevant data from the Azure Synapse SQL pool or data lake into Power BI datasets for data visualization and exploration. Power BI models implement a semantic model to simplify the analysis of business data and relationships. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.

  2. Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.

  3. Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.

Relational databases

Ingest

  1. Use Azure Synapse pipelines to pull data from a wide variety of databases, both on-premises and in the cloud. Pipelines can be triggered based on a pre-defined schedule, in response to an event, or can be explicitly called via REST APIs.

Store

  1. Organize your data lake following the best practices around which zones to create, what folder structures to use in each zone and what files format to use for each analytics scenario.

  2. From the Azure Synapse pipeline, use a Copy data activity to stage the data copied from the relational databases into the raw zone of your Azure Data Lake Store Gen 2 data lake. You can save the data in delimited text format or compressed as Parquet files.

Process and enrich

  1. Use either data flows, SQL serverless queries, or Spark notebooks to validate, transform, and move the datasets into your Curated zone in your data lake.

    1. As part of your data transformations, you can invoke machine-learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate further business insights. These machine-learning models can be consumed from Azure Cognitive Services or custom ML models from Azure ML.
  2. You can serve your final dataset directly from the data lake Curated zone or you can use Copy Data activity to ingest the final dataset into your SQL pool tables using the COPY command for fast ingestion.

Serve

  1. Load relevant data from the Azure Synapse SQL pool or data lake into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.

  2. Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.

  3. Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.

Semi-structured data sources

Ingest

  1. Use Azure Synapse pipelines to pull data from a wide variety of semi-structured data sources, both on-premises and in the cloud. For example:

    • Ingest data from file-based sources containing CSV or JSON files.
    • Connect to No-SQL databases such as Cosmos DB or Mongo DB.
    • Call REST APIs provided by SaaS applications that will function as your data source for the pipeline.

Store

  1. Organize your data lake following the best practices around which zones to create, what folder structures to use in each zone and what files format to use for each analytics scenario.

  2. From the Azure Synapse pipeline, use a Copy data activity to stage the data copied from the semi-structured data sources into the raw zone of your Azure Data Lake Store Gen 2 data lake. Save data to preserve the original format, as acquired from the data sources.

Process and enrich

  1. For batch/micro-batch pipelines, use either data flows, SQL serverless queries or Spark notebooks to validate, transform, and move your datasets into your Curated zone in your data lake. SQL Serverless queries expose underlying CSV, Parquet, or JSON files as external tables, so that they can be queried using T-SQL.

    1. As part of your data transformations, you can invoke machine-learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate further business insights. These machine-learning models can be consumed from Azure Cognitive Services or custom ML models from Azure ML.
  2. You can serve your final dataset directly from the data lake Curated zone or you can use Copy Data activity to ingest the final dataset into your SQL pool tables using the COPY command for fast ingestion.

  3. For near real-time telemetry and time-series analytics scenarios, use Data Explorer pools to easily ingest, consolidate, and correlate logs and IoT events data across multiple data sources. With Data Explorer pools, you can use Kusto queries (KQL) to perform time-series analysis, geospatial clustering, and machine learning enrichment.

Serve

  1. Load relevant data from the Azure Synapse SQL pools, Data Explorer pools, or a data lake into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.

  2. Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.

  3. Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.

Non-structured data sources

Ingest

  1. Use Azure Synapse pipelines to pull data from a wide variety of non-structured data sources, both on-premises and in the cloud. For example:

    • Ingest video, image, audio, or free text from file-based sources that contain the source files.
    • Call REST APIs provided by SaaS applications that will function as your data source for the pipeline.

Store

  1. Organize your data lake by following the best practices about which zones to create, what folder structures to use in each zone, and what files format to use for each analytics scenario.

  2. From the Azure Synapse pipeline, use a Copy data activity to stage the data copied from the non-structured data sources into the raw zone of your Azure Data Lake Store Gen 2 data lake. Save data by preserving the original format, as acquired from the data sources.

Process and enrich

  1. Use Spark notebooks to validate, transform, enrich, and move your datasets into your Curated zone in your data lake.

    1. As part of your data transformations, you can invoke machine-learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate further business insights. These machine-learning models can be consumed from Azure Cognitive Services or custom ML models from Azure ML.
  2. You can serve your final dataset directly from the data lake Curated zone or you can use Copy Data activity to ingest the final dataset into your data warehouse tables using the COPY command for fast ingestion.

Serve

  1. Load relevant data from the Azure Synapse SQL pool or data lake into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.

  2. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.

  3. Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.

  4. Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms, and web pages.

Streaming

Ingest

  1. Use Azure Event Hubs or Azure IoT Hubs to ingest data streams generated by client applications or IoT devices. Event Hubs or IoT Hub will then ingest and store streaming data preserving the sequence of events received. Consumers can then connect to Event Hubs or IoT Hub endpoints and retrieve messages for processing.

Store

  1. Organize your data lake following the best practices around which zones to create, what folder structures to use in each zone and what files format to use for each analytics scenario.

  2. Configure Event Hubs Capture or IoT Hub Storage Endpoints to save a copy of the events into the Raw zone of your Azure Data Lake Store Gen 2 data lake. This feature implements the "Cold Path" of the Lambda architecture pattern and allows you to perform historical and trend analysis on the stream data saved in your data lake using SQL Serverless queries or Spark notebooks following the pattern for semi-structured data sources described above.

Process and enrich

  1. For real-time insights, use a Stream Analytics job to implement the "Hot Path" of the Lambda architecture pattern and derive insights from the stream data in transit. Define at least one input for the data stream coming from your Event Hubs or IoT Hub, one query to process the input data stream and one Power BI output to where the query results will be sent to.

    1. As part of your data processing with Stream Analytics, you can invoke machine-learning models to enrich your stream datasets and drive business decisions based on the predictions generated. These machine-learning models can be consumed from Azure Cognitive Services or from custom ML models in Azure Machine learning.
  2. Use other Stream Analytics job outputs to send processed events to Azure Synapse SQL pools or Data Explorer pools for further analytics use cases.

  3. For near real-time telemetry and time-series analytics scenarios, use Data Explorer pools to easily ingest IoT events directly from Event Hubs or IoT Hubs. With Data Explorer pools, you can use Kusto queries (KQL) to perform time-series analysis, geospatial clustering, and machine learning enrichment.

Serve

  1. Business analysts then use Power BI real-time datasets and dashboard capabilities to visualize the fast changing insights generated by your Stream Analytics query.

  2. Data can also be securely shared to other business units or external trusted partners using Azure Data Share. Data consumers have the freedom to choose what data format they want to use and also what compute engine is best to process the shared datasets.

  3. Structured and unstructured data stored in your Synapse workspace can also be used to build knowledge mining solutions and use AI to uncover valuable business insights across different document types and formats including from Office documents, PDFs, images, audio, forms and web pages.

Discover and govern

Data governance is a common challenge in large enterprise environments. On one hand, business analysts need to be able to discover and understand data assets that can help them solve business problems. On the other hand, Chief Data Officers want insights on privacy and security of business data.

Microsoft Purview

  1. Use Microsoft Purview for data discovery and insights on your data assets, data classification, and sensitivity, which covers the entire organizational data landscape.

  2. Microsoft Purview can help you maintain a business glossary with the specific business terminology required for users to understand the semantics of what datasets mean and how they are meant to be used across the organization.

  3. You can register all your data sources and organize them into Collections, which also serves as a security boundary for your metadata.

  4. Setup regular scans to automatically catalog and update relevant metadata about data assets in the organization. Microsoft Purview can also automatically add data lineage information based on information from Azure Data Factory or Azure Synapse pipelines.

  5. Data classification and data sensitivity labels can be added automatically to your data assets based on pre-configured or customs rules applied during the regular scans.

  6. Data governance professionals can use the reports and insights generated by Microsoft Purview to keep control over the entire data landscape and protect the organization against any security and privacy issues.

Platform services

In order to improve the quality of your Azure solutions, follow the recommendations and guidelines defined in the Azure Well-Architected Framework five pillars of architecture excellence: Cost Optimization, Operational Excellence, Performance Efficiency, Reliability, and Security.

Following these recommendations, the services below should be considered as part of the design:

  1. Azure Active Directory: identity services, single sign-on and multi-factor authentication across Azure workloads.
  2. Azure Cost Management: financial governance over your Azure workloads.
  3. Azure Key Vault: secure credential and certificate management. For example, Azure Synapse Pipelines, Azure Synapse Spark Pools and Azure ML can retrieve credentials and certificates from Azure Key Vault used to securely access data stores.
  4. Azure Monitor: collect, analyze, and act on telemetry information of your Azure resources to proactively identify problems and maximize performance and reliability.
  5. Microsoft Defender for Cloud: strengthen and monitor the security posture of your Azure workloads.
  6. Azure DevOps & GitHub: implement DevOps practices to enforce automation and compliance to your workload development and deployment pipelines for Azure Synapse and Azure ML.
  7. Azure Policy: implement organizational standards and governance for resource consistency, regulatory compliance, security, cost, and management.

Components

The following Azure services have been used in the architecture:

Alternatives

Considerations

The technologies in this architecture were chosen because each of them provides the necessary functionality to handle the most common data challenges in an organization. These services meet the requirements for scalability and availability, while helping them control costs. The services covered by this architecture are only a subset of a much larger family of Azure services. Similar outcomes can be achieved by using other services or features not covered by this design.

Specific business requirements for your analytics use cases may also ask for the use of different services or features not considered in this design.

Similar architecture can also be implemented for pre-production environments where you can develop and test your workloads. Consider the specific requirements for your workloads and the capabilities of each service for a cost-effective pre-production environment.

Pricing

In general, use the Azure pricing calculator to estimate costs. The ideal individual pricing tier and the total overall cost of each service included in the architecture is dependent on the amount of data to be processed and stored and the acceptable performance level expected. Use the guide below to learn more about how each service is priced:

  • Azure Synapse Analytics serverless architecture allows you to scale your compute and storage levels independently. Compute resources are charged based on usage, and you can scale or pause these resources on demand. Storage resources are billed per terabyte, so your costs will increase as you ingest more data.

  • Azure Data Lake Gen 2 is charged based on the amount of data stored and based on the number of transactions to read and write data.

  • Azure Event Hubs and Azure IoT Hubs are charged based on the amount of compute resources required to process your message streams.

  • Azure Machine Learning charges come from the amount of compute resources used to train and deploy your machine-learning models.

  • Cognitive Services is charged based on the number of call you make to the service APIs.

  • Microsoft Purview is priced based on the number of data assets in the catalog and the amount of compute power required to scan them.

  • Azure Stream Analytics is charged based on the amount of compute power required to process your stream queries.

  • Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included in the pricing sample above.

  • Azure CosmosDB is priced based on the amount of storage and compute resources required by your databases.

Next steps