Best practice for create ETL pipelines in Azure

Ömer Faruk Özsakarya 81 Reputation points
2021-02-27T18:23:02.673+00:00

Hi all,

We are planning to migrate Apache Airflow to Azure. There are two alternatives: Azure Data Factory and Azure Synapse Analytics (pipelines) I am confused between these.

In brief, we will get data from on premise databases. Then we need to transfer this data into Azure storage. There are two options here two: Azure Blob Storage and Azure Data Lake Storage. I am also confused between these choices. When do we use Blob Storage and when do we use Data Lake Storage. Then we need to ingest the data into SQL Pools(former: Sql Data Warehouse) Then we will run stored procedures for loading data into fact and dimension tables.

In order to create this ETL flow, do we need to use Azure Data Factory or Azure Synapse Analytics (pipelines etc) and why?

Thanks

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,338 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,363 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,528 questions
{count} votes

Accepted answer
  1. Samara Soucy - MSFT 5,051 Reputation points
    2021-03-02T03:36:03.827+00:00

    Mark is correct that the two services are very similar, just as you have noticed. If your output is within Synapse, then using Pipelines will be easier unless you need one of the features only available in ADF. The differences are listed here: https://learn.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences

    For storage integration with Synapse, you can use either, and which one depends on how you are going to access the data. From the information you've provided, I would choose Data Lake, especially if you think you'll ever need to access the data directly from either SQL or Spark pools within Synapse. Data Lake is built on top of Blob Storage, so there is no difference in the underlying infrastructure. The difference is that Blob storage is essentially a general-purpose storage option, where as Data Lake adds big data optimized drivers, and use Hadoop permissions and access, which pairs better with Synapse pools.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful