Extract Delta Changes on big CSV files

Lintao Yu 1 Reputation point
2021-05-20T03:46:23.307+00:00

We are exporting data from Microsoft Dataverse (Dynamics 365) into Azure Data Lake. The files are saved in csv formats and partitioned in yearly files based on the modified on date. The file could grow quite large for some frequently used tables in a year. How can we grab only the delta changes in every few minutes?

I know you can do incremental load from SQL tables using date watermarks. But that requires SQL query. Is there a way to quickly query a csv file and extract the data in that watermark window?

Any thoughts would be appreciated.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,946 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,640 questions
{count} votes

1 answer

Sort by: Most helpful
  1. OmarSiado-MSFT 156 Reputation points Microsoft Employee
    2021-05-20T18:28:57.363+00:00

    Hi @Lintao Yu ,

    I think you could use SQL to create separate files with delta only, will need to define a patter for file naming, then you can use Get Metadata object to get only files created since x time, last 4 hours for example. Another way you could take a look is creating event based or Tumbling windows trigger to handle every time you create a file into container for example.

    here's the link for some related documentation:
    https://learn.microsoft.com/en-us/azure/data-factory/solution-template-copy-new-files-lastmodifieddate

    https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger

    https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger

    https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

    Please let us know and/or vote in case the suggestion works for you!

    0 comments No comments