question

innotekmst-3741 avatar image
0 Votes"
innotekmst-3741 asked ·

Best practice for create ETL pipelines in Azure

Hi all,

We are planning to migrate Apache Airflow to Azure. There are two alternatives: Azure Data Factory and Azure Synapse Analytics (pipelines) I am confused between these.

In brief, we will get data from on premise databases. Then we need to transfer this data into Azure storage. There are two options here two: Azure Blob Storage and Azure Data Lake Storage. I am also confused between these choices. When do we use Blob Storage and when do we use Data Lake Storage. Then we need to ingest the data into SQL Pools(former: Sql Data Warehouse) Then we will run stored procedures for loading data into fact and dimension tables.

In order to create this ETL flow, do we need to use Azure Data Factory or Azure Synapse Analytics (pipelines etc) and why?

Thanks

azure-data-factoryazure-synapse-analyticsazure-data-lake-storage
· 1
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Since your final data destination is SQL Pools, I would choose Synapse for this. ADF pipelines and data flows are essentially the same in ADF & Synapse. One slight nitpick to your statement ... If you are going to run stored procs after landing the data in SQL Pools, that is actually ELT. In ADF, ETL would actually mean that you use dataflows for your dimension and fact table management.

1 Vote 1 ·

1 Answer

SamaraSoucy-MSFT avatar image
1 Vote"
SamaraSoucy-MSFT answered ·

Mark is correct that the two services are very similar, just as you have noticed. If your output is within Synapse, then using Pipelines will be easier unless you need one of the features only available in ADF. The differences are listed here: https://docs.microsoft.com/en-us/azure/synapse-analytics/data-integration/concepts-data-factory-differences

For storage integration with Synapse, you can use either, and which one depends on how you are going to access the data. From the information you've provided, I would choose Data Lake, especially if you think you'll ever need to access the data directly from either SQL or Spark pools within Synapse. Data Lake is built on top of Blob Storage, so there is no difference in the underlying infrastructure. The difference is that Blob storage is essentially a general-purpose storage option, where as Data Lake adds big data optimized drivers, and use Hadoop permissions and access, which pairs better with Synapse pools.


· 1 ·
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.