question

DB-9790 avatar image
0 Votes"
DB-9790 asked MarkKromer-MSFT commented

Azure - Reduce Data Flow Activity Time

Hi there

I need to look for a way to reduce the time that the pipeline is consuming. I've got a simple Pipeline, that has two data flow activities, that is taking a lot of time for processing only one register:

1) Sorce to Staging: Overall time 5m 10 sec - Processing time 11s 803ms

2) Staging to DWH: Overall time 5m 2 sec - Processing time 2s 757ms

I read that this is because every Data Flow will require between 5-7 min for the cluster startup time and that it is necessary to modify the TTL of the Azure IR.

These are my questions. If the Azure IR is modified, how will this affect those Pipelines that only have one Data Flow activity, are the going to experience any decrease in the execution time?. In my example pipeline, it is possible to modify the second Data Flow to implement a Stored Procedure, by doing this, what should happen with the execution time of the entire pipeline? And finally, what is the price of modify the TTL?

I do not have permission for setting these values or modify the ETL, that is why I'm asking, I would like to be sure about this before making any proposal.

Regards,





azure-data-factory
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

KranthiPakala-MSFT avatar image
2 Votes"
KranthiPakala-MSFT answered DanielH-3193 commented

Hi DB-9790,

Welcome to Microsoft Q&A platform and thanks for your query.

  1. If you leave the TTL to 0, ADF will always spawn a new Spark cluster environment for every Data Flow activity that executes. This means that an Azure Databricks cluster is provisioned each time and takes about ~ 4-5 minutes to become available and execute your job.

  2. If you set a TTL, then the minimum billing time will be that amount of time. ADF will maintain that pool for the TTL time after the last data flow pipeline activity executes. Note that this will extend your billing period for a data flow to the extended time of your TTL.

  3. If you have a pipeline with single data flow activity then it is better to use an Azure IR without TTL, since it will be billed only for the time to acquire compute + job execution time. In case if you set the TTL and use it only for single data flow activity, then billing = time to acquire warn pool + job execution time + TTL time after the last data flow pipeline activity executes.

  4. The TTL setting is helpful when you have a pipeline with sequential data flow executions. Which will allow you to stand-up a pool of cluster compute resources for your factory. With this pool, you can sequentially submit data flow activities for execution. Once the pool is established (The initial set-up of the resource pool will take around ~5 minutes), each subsequent job will take 1-2 minutes for the on-demand Spark cluster to execute your job (i.e., ~5min +2min + 2min + 2min + ...). In case if TTL is not set, then each subsequent job also will take ~ 5min (i.e., ~5min + ~5min+ ~5min+ ... ).

  5. I would recommend to have two different Azure IR's (one with no TTL and other with TTL set)
    a. For pipelines with single data flow activity - Use Azure IR without TTL
    b. For pipelines with sequential data flow activities - Use Azure IR with TTL set.

  6. For Data flow execution pricing please refer to below docs:
    a. ADF Data flow execution pricing
    b. Understanding Data Factory pricing through examples


Hope this info helps. Do let us know if you have further query.



Thank you




· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @KranthiPakala-MSFT Thanks a lot for your answer, it was very clear.


0 Votes 0 ·

@KranthiPakala-MSFT . By taking a look at your fourth explanation. What is happening with the TTL if I have different pipelines with multiples Dataflows sharing the same IR? Each one of them needs to acquire the resources as well, or if they are executed in parallel (waiting for the first dataflow to finish), can they use the same cluster?


0 Votes 0 ·

On our side, we have the very same question. What happens if the data flows reside in different, subsequent pipelines instead of a sequence of dataflows inside a single pipeline????

Will every pipeline need to wait for the spin-up time? Despite the TTL being set to 30 mins like in our case??

0 Votes 0 ·
DanielH-3193 avatar image DanielH-3193 JessAlejandroGonzlezCaas-3064 ·

^bump

I also require an answer to this exact question.

0 Votes 0 ·
Kiran-MSFT avatar image
0 Votes"
Kiran-MSFT answered DanielH-3193 commented

Whether it be the same pipeline or another using the same IR will reuse the cluster if it is NOT in use(idling on TTL). If the IR is already running another dataflow, it will spin up another cluster.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks Kiran.

0 Votes 0 ·
Al-6819 avatar image
0 Votes"
Al-6819 answered MarkKromer-MSFT commented

Kiran/Kranthi,

Do you know anything about the 'Quick Re-use' functionality being available on Synapse? We have to (still) wait for 1-2 minutes just for the next dataflow to be kicked off, and that's WITH the TTL setting enabled. You can say it's an improvement from apprx 5-6 minutes to spin off a new cluster but it's still not the best outcome given we have hundreds of dataflows that need to be executed in sequence.

Mark Kromer in one of his posts https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365 referred to this functionality in ADF context. What's the latest and the greatest on this and is there a roadmap to make it available on Synapse?

Thanks
Alex

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

The TTL feature in Synapse data flows was released a few weeks back. You can now start subsequent data flow activities in under 1 minute. The quick re-use feature will be released to Synapse data flows in the next 1-2 months. At that time, startup time for subsequent data flows will be reduced to 3-4 seconds.

1 Vote 1 ·