DWU Consumption sink with staging vs. copy data with polybase vs. copy data with copy command

Paul Hernandez 631 Reputation points Microsoft Employee
2021-04-27T08:33:20.59+00:00

Dear experts,

I am analysing the different options to load data into azure synapses dedicated sql pool and found relevant differences in the DWU consumption.

I tried 3 different methods:

  1. Transforming the data in a data flow and then insert it with a sink with staging enable (is this using polybase in the background?)
  2. Transforming the data in a data flow, copy it to an ADLS folder in parquet format, then moving it to the database using a copy activity with Polybase.
  3. Transforming the data in a data flow, copy it to an ADLS folder in parquet format, then moving it to the database using a copy activity with the copy command.

After many tests I realized the DWU usage for the first method, the insert using a sink in a data flow, consumes around 4 times more DWUs:

91692-image.png

Could someone explain this behavior?

I repeated the tests several time, so this is not occasionally happens.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,422 questions
{count} votes

2 answers

Sort by: Most helpful
  1. HimanshuSinha-msft 19,381 Reputation points Microsoft Employee
    2021-05-03T21:59:36.067+00:00

    Hello @Paul Hernandez ,

    Aplozoges for the delay in response from my side . In data flow when we have sink as Synapse with staging we fist copy the records to from the blob to a temp table and then from temp table to the target table . tghis may be the reason you are seeing a spike on the DWU consumption .

    Thanks
    Himanshu

    0 comments No comments

  2. Paul Hernandez 631 Reputation points Microsoft Employee
    2021-05-12T15:46:09.677+00:00

    Hi @HimanshuSinha-msft ,

    thanks for asking.
    According to the Microsoft support, when you use a Sink in a data flow with the staging option enabled, the component makes use of the copy command but in a different way as in the copy activity. The data flows are optimized for upserts scenarios and it worths to mention that we are doing full loads of the tables everytime, which is of course impacting the performance.
    If I got ir correctly, the sink created a copy of the data in the data lake (staging) and the created temporary tables in the synapse SQL database and finally copying the values to the target table. It seems that for the copy activity this is not exactly the case and the step with the temporary tables is not required.

    This would explain the different in the DWUcs consumption.

    B.R.
    Paul

    0 comments No comments