question

SubinPius-5180 avatar image
1 Vote"
SubinPius-5180 asked PRADEEPCHEEKATLA-MSFT commented

Incrementally Load Data to Azure Data Lake

Hi,

I understand the concept of incremental load to data lake with each days data stored as different file in the data lake storage.

My question is how to handle to records from source which are updated and not inserted in the incremental load to data lake storage

For example, say I have a record from requests table in onpremise sql server database with the status as open.
When the ADF pipeline runs today, this data is stored in the data lake storage in a csv file.
Tomorrow the status of the record changes to pending and when the ADF pipeline runs again, the modified record is read and copied in a new file in the data lake storage with status as pending.

Now in the Data lake storage, I have 2 files with the same request record but with different status.

How to handle such scenarios in Data lake storage to have a single record without duplication.

azure-data-factoryazure-data-lake-storage
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @SubinPius-5180,

Just checking in to see if the below answer provided by @nasreen-akter helped. If this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.

0 Votes 0 ·

Hello @SubinPius-5180,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

0 Votes 0 ·

1 Answer

nasreen-akter avatar image
1 Vote"
nasreen-akter answered nasreen-akter edited

Hi @SubinPius-5180,

Thank you for using MS Q&A.

I think you can do the following:

option#1: you can have an updated_time for each record. When Consumer process will pick up the data from the datalake, it will sort the record by updated_time for each recordId and only process the latest item/row for that recordId

option#2: If it's a full load each time, you can have a timespan in the filename e.g., 20210607 that is yyMMdd or you can maintain a folder hierarchy to save the csv file. And then let the 'Consumer` process only pick up the latest file.

Hope this will help. Thanks!

--Nasreen

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.