Why do my dataflow pipelines spend 5 minutes in acquiring compute state every time?

KranthiPakala-MSFT 46,412 Reputation points Microsoft Employee
2020-05-08T22:50:52.553+00:00

I have a bunch of pipelines using dataflows for data transformations each with different sources and sinks. All pipeline exactly takes 5 minutes to spin up a cluster for these dataflows even though they are all triggered almost at the same time. Is there a way we can configure such that running pipeline use existing cluster for dataflows?

[Note: As we migrate from MSDN, this question has been posted by an Azure Cloud Engineer as a frequently asked question] Source: MSDN

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,267 questions
0 comments No comments
{count} vote

Accepted answer
  1. HimanshuSinha-msft 601 Reputation points
    2020-05-08T22:53:29.003+00:00

    Welcome to the Microsoft Q&A (Preview) platform.

    Happy to answer your query.

    Dataflow runs behind on spark clusters which are managed by ADF. Clusters are created on demand from scratch and will be destroyed after job is done. Each Dataflow activity runs on separate cluster, so if 2 jobs run on the IR it will be charged for 2 clusters as we spin 2 clusters. Each job is isolated. That's the reason that 5-6 minutes for acquiring compute. Once compute is acquired, the job runs and kill the cluster after the job run is completed.

    There is exception though that user can set TimeToLive in Azure IR, and this will keep cluster alive for next job (if the job falls in this time period). Like if you set the TTL for 10 minutes, it will wait if there is any other job for same IR arrives and continue the cycle. If no job arrives in 10 minutes it kills the cluster.

    To reuse already created cluster in subsequent jobs look at following blog from Mark Kromer.

    https://techcommunity.microsoft.com/t5/Azure-Data-Factory/ADF-adds-TTL-to-Azure-IR-to-reduce-Data-Flow-activity-times/ba-p/878380

    Additional info:

    Related MSDN thread: https://social.msdn.microsoft.com/Forums/en-US/91d388d9-730f-4d53-93b2-2a8697513511/azure-dataflow-execution-behaviour?forum=AzureDataFactory

    Hope this helps. Let us know if you have any further query.

    4 people found this answer helpful.

0 additional answers

Sort by: Most helpful