question

KranthiPakala-MSFT avatar image
1 Vote"
KranthiPakala-MSFT asked ·

Why do my dataflow pipelines spend 5 minutes in acquiring compute state every time?

I have a bunch of pipelines using dataflows for data transformations each with different sources and sinks. All pipeline exactly takes 5 minutes to spin up a cluster for these dataflows even though they are all triggered almost at the same time. Is there a way we can configure such that running pipeline use existing cluster for dataflows?

[Note: As we migrate from MSDN, this question has been posted by an Azure Cloud Engineer as a frequently asked question] Source: MSDN


azure-data-factory
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

HimanshuSinhamfst-5269 avatar image
3 Votes"
HimanshuSinhamfst-5269 answered ·

Welcome to the Microsoft Q&A (Preview) platform.

Happy to answer your query.


Dataflow runs behind on spark clusters which are managed by ADF. Clusters are created on demand from scratch and will be destroyed after job is done. Each Dataflow activity runs on separate cluster, so if 2 jobs run on the IR it will be charged for 2 clusters as we spin 2 clusters. Each job is isolated. That's the reason that 5-6 minutes for acquiring compute. Once compute is acquired, the job runs and kill the cluster after the job run is completed.

There is exception though that user can set TimeToLive in Azure IR, and this will keep cluster alive for next job (if the job falls in this time period). Like if you set the TTL for 10 minutes, it will wait if there is any other job for same IR arrives and continue the cycle. If no job arrives in 10 minutes it kills the cluster.

To reuse already created cluster in subsequent jobs look at following blog from Mark Kromer.

https://techcommunity.microsoft.com/t5/Azure-Data-Factory/ADF-adds-TTL-to-Azure-IR-to-reduce-Data-Flow-activity-times/ba-p/878380

Additional info:

Related MSDN thread: https://social.msdn.microsoft.com/Forums/en-US/91d388d9-730f-4d53-93b2-2a8697513511/azure-dataflow-execution-behaviour?forum=AzureDataFactory

Hope this helps. Let us know if you have any further query.

· 2 ·
10 |1000 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

To add to the above explanation, TimeToLive works when activities are submitted sequentially. If multiple parallel runs are submitted, each run would still take 5 min to spin up the cluster.

0 Votes 0 ·

Hello,
How does this work when we have both parallel and sequential data flows in a pipeline ? How should the TTL cluster to data flow activity ?

0 Votes 0 ·