Transient error

MRD 40 Reputation points
2024-05-14T13:57:33.25+00:00

I had a data flow in ADF that failed due to a transient error in the middle of execution. The data flow had insert, update, and delete operations based on SCD2 scenario. I applied the retry option to avoid transient failures, but during the second attempt, some data was already loaded in the sink table in the first attempt, causing duplicate records in the table for the insert scenario. I did not implement CDC at the source level, and I did not see any option at the sink setting about checkpoints to avoid duplicate data. Is there any option to restart the job from the point of failure during the second retry

Due to the lack of CDC implementation at the source level, the option to avoid duplicate data checkpoints is not available at the sink setting.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,794 questions
{count} votes

Accepted answer
  1. Harishga 4,325 Reputation points Microsoft Vendor
    2024-05-14T15:00:13.21+00:00

    Hi @MRD
    Welcome to Microsoft Q&A platform and thanks for posting your question here.

    To avoid duplicate inserts in ADF data flows with SCD2 scenarios during retries, there are a few approaches you can take:

    Firstly, you can utilize ADF's built-in retry functionality for the data flow activity within your pipeline. This involves defining the number of retries and the wait interval between attempts. You can also enable state persistence on your Integration Runtime to store the data flow's execution state between retries. However, state persistence only works for certain data flow types and might not guarantee complete avoidance of duplicates for SCD2 inserts in all cases, especially if the failure happened after some inserts.

    Secondly, you can implement a custom retry logic with a high water mark table. This involves creating a separate table to store the "high water mark" of the data flow execution. This table would have a single column that keeps track of the last successfully processed record identifier. In your data flow, before performing the insert operation, you would query the high water mark table and filter the incoming data to exclude records processed before the last successful run. You would then update the high water mark table after a successful insert batch. This approach offers more control over retry behavior and avoids duplicate inserts for SCD2 scenarios.

    Thirdly, you could consider implementing CDC at the source level if feasible. This allows ADF to capture only the changes in the source data, eliminating the need for full data transfers and reducing the risk of duplicates during retries. Alternatively, you could evaluate if switching to a different data integration tool might be a viable option for your specific needs.

    Choosing the right approach depends on your specific data flow type, the nature of transient errors, and the complexity of your SCD2 logic. It's important to consider the trade-offs between development effort, performance overhead, and the level of control you require over retry behavior.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful