Best practice for create ETL pipelines in Azure

Question

Hi all,

We are planning to migrate Apache Airflow to Azure. There are two alternatives: Azure Data Factory and Azure Synapse Analytics (pipelines) I am confused between these.

In brief, we will get data from on premise databases. Then we need to transfer this data into Azure storage. There are two options here two: Azure Blob Storage and Azure Data Lake Storage. I am also confused between these choices. When do we use Blob Storage and when do we use Data Lake Storage. Then we need to ingest the data into SQL Pools(former: Sql Data Warehouse) Then we will run stored procedures for loading data into fact and dimension tables.

In order to create this ETL flow, do we need to use Azure Data Factory or Azure Synapse Analytics (pipelines etc) and why?

Thanks

Accepted Answer

Mark is correct that the two services are very similar, just as you have noticed. If your output is within Synapse, then using Pipelines will be easier unless you need one of the features only available in ADF. The differences are listed here: https://learn.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences

For storage integration with Synapse, you can use either, and which one depends on how you are going to access the data. From the information you've provided, I would choose Data Lake, especially if you think you'll ever need to access the data directly from either SQL or Spark pools within Synapse. Data Lake is built on top of Blob Storage, so there is no difference in the underlying infrastructure. The difference is that Blob storage is essentially a general-purpose storage option, where as Data Lake adds big data optimized drivers, and use Hadoop permissions and access, which pairs better with Synapse pools.

Best practice for create ETL pipelines in Azure

0 additional answers