question

atarantin avatar image
0 Votes"
atarantin asked ShaikMaheer-MSFT commented

Dataflow takes long time

Hello,

I have a a pipeline with 1 dataflow. It takes between 2 and 3 minutes to run it and I do not understand why it takes so much time.

For instance, a run took 2 minutes 05.

The data flow inside it took 02 minutes 04.

There are 5 steps in my dataflow:
1) Take files from an azure storage container
2) Flatten the files
3) I do a derived column
4) I set the upsert for the database
5) I insert data to cosmos db

If I take a look to the detailed metrics, each step of the dataflow took 10s (with a cluster startup time to 1s 479ms for the dataflow).

I am already using an integration runtime with a TTL.

So do you have an idea on how my pipeline/dataflow can be so long while the detailed dataflow metrics are so short?

azure-data-factory
· 8
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Look at the monitoring view of the data flow activity. You can click on each transformation to see how long each step took so you can identify your bottleneck. Since you are writing data to CosmosDB, I would focus on the Sink transformation and look to see how long the Sink took to serialize the documents. You may be getting throttled by the throughput setting on your target collection.

0 Votes 0 ·

In the monitoring of the data flow (when I click on the glasses) the "sink processing time" is 11s.

0 Votes 0 ·

Use the bottom panel of your monitoring view to sort by processing time and find the longest processing steps

0 Votes 0 ·

IIt is the sink processing time which is the highest one.

0 Votes 0 ·

Hi @atarantin ,

You mentioned "each step of dataflow took 10s", and you have 5 steps in your data flow, So if we sum it your execution time should be 50s to 60s only right?

Are you trying to say your execution time of data flow is actually less but data flow activity itself shows over all duration as 2mins 5s?

You mentioned you already mentioned TTL. What is the value you mentioned there? Kindly check below to know more details about TTL and its billing.
https://docs.microsoft.com/en-us/answers/questions/38550/azure-data-flow-activity-time.html

Did you checked Quick re-use box under TTL settings?

between, Click here to know more details about monitoring data flows. Click here to know more details about Mapping data flows performance and tuning guide.

0 Votes 0 ·

Yes exactly, the whole pipeline takes 2 min 5s and the dataflow itself, only 50 to 60s.

The TTL is set to 10 minutes because the dataflow should run every 5 minutes. And the box "quick re-use" is checked.

II have done some modifications around the dataflow in order to pre filter my data before processing and now I have a dataflow which takes 1s/2s itself and the whole processing time for the pipeline is 20s.

Do you know why I have a so high overhead? (20s vs 1s)

0 Votes 0 ·

Hi @atarantin ,

  • Is there any other activities in pipeline or only Data flow activity?

  • Is it debug run or trigger run?

  • If its trigger run, then please mention type of trigger as well?

  • It would be more helpful if you can share your Pipeline run and activity runs execution's screenshot to understand issue better. Thank you.

0 Votes 0 ·
Show more comments

0 Answers