question

TinasheChinyati-7683 avatar image
0 Votes"
TinasheChinyati-7683 asked TinasheChinyati-7683 commented

How best to copy files from onprem to azure using the file content to create partitions using ADF

Hi, I have a folder with CSV files. I want to copy and create folder partitions in Azure datalake using the file content timestamp. How best should I approach this? I tried the lookup activity but the limitation of 5000 rows blocked me. Thanks for assisting

azure-data-factory
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @TinasheChinyati-7683,

Thank you for Posting your query.

Could you please clarify on my understanding for the above ask?


I want to copy and create folder partitions in Azure datalake using the file content timestamp

Do you mean, if we have a file name as "abcd.csv" with timestamp column in it with value as "2021-06-14"(all rows will have same value) then in data lake you want to create below folder hierarchy.

105286-image.png

I tried the lookup activity but the limitation of 5000 rows blocked me

Do you mean, your csv files may have rows with more than 5000 and we would like to know how to process them?


0 Votes 0 ·
image.png (7.2 KiB)

@ShaikMaheerMSFT-7578 I have checked file content its less than 5000 rows. So what you said is exactly what i need, creating partitions based on timestamp. column1: 2021-06-14 column2: test1 column3:24 so test1 with value 24 must be copied to the datalake under that date partition

0 Votes 0 ·

hi @ShaikMaheerMSFT-7578, so i have a date partitions (day 02) on prem and inside that date I have csv files (max count: 2000) but I want to copy these files to the datalake using the timestamp column inside each csv file.
So I need a hierarchy as 2021 (Year)-06 (month)-14(day)- RECORD THAT CORRESPONDS TO THAT PARTICULAR DATE

[1]: /answers/storage/attachments/105302-copy.png

0 Votes 0 ·
copy.png (5.0 KiB)
ShaikMaheer-MSFT avatar image
0 Votes"
ShaikMaheer-MSFT answered ShaikMaheer-MSFT commented

Hello @TinasheChinyati-7683,

Thank you for Clarifications on Query.

Please check out the below detailed example, which implements dates hierarchy path.

For example, let's assume I have two files in folder as below.

105310-image.png

Step 1: GetMetaData activity to get files from folder

105280-getmetadataactivity.gif

Step 2: Pass GetMetaDataAcitivity outout childItems to ForEach activity

105345-foreachactivity.gif

Step 3: Create date, month & date variables

105372-createvariables.gif

Step 4: Lookup activity to get first row from file

105346-lookupactivity.gif

Step 5: Set values in to 3 variables

105347-setvariablevalues.gif

Step 6: Copy activity to copy file to destination by creating path with date hierarchy

105348-copy.gif

Step 7: Final results

105373-image.png


Hope this helps. Do let us know if you any further queries.


  • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.

  • Want a reminder to come back and check responses? Here is how to subscribe to a notification.


image.png (108.3 KiB)
foreachactivity.gif (456.2 KiB)
createvariables.gif (137.9 KiB)
lookupactivity.gif (712.6 KiB)
copy.gif (2.0 MiB)
image.png (25.7 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks @ShaikMaheer-MSFT , it works. How about if the csv files do not have a column header?

0 Votes 0 ·

Hi @TinasheChinyati-7683

Glad to know that proposed solution works for you.

If your csv file doesn’t have column names defined in it then un-select First row as header inside dataset. ADF will add default schema to it as shown below.
So, on top of that default schema you can go head and implement above explained solution.

106132-csvfilewithoutheader.png
106117-datasetwithoutheadercsv.gif

So, on top of that default schema you can go head and implement above explained solution. Hope this helps. Thank you.

Hope this helps. Thank you. Do let us know if you any further queries.


2 Votes 2 ·
BediBhawna-1495 avatar image
0 Votes"
BediBhawna-1495 answered ShaikMaheer-MSFT commented

Well explained!!! @ShaikMaheer-MSFT thankyou

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

TinasheChinyati-7683 avatar image
0 Votes"
TinasheChinyati-7683 answered TinasheChinyati-7683 commented

@ShaikMaheer-MSFT Thank you once again for the clear explain. following up on this discussion, we are able to create a date hierarchy using the file content itself, but it then copies the original file into that particular partition. How then do we ensure that we copy the corresponding details to that date. For example if we use Prop_0 column and partition using the date. I need to see test1 and 24 and all other details that fall within that date inside the partition. Another partition for 2021-07-02 will be created and Test 4 and 70 will be inside that partition and so on. Can you please assist with that. Thanks once again107283-test-case.png



test-case.png (38.1 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

You should use dataflow to solve involved use-cases with data. Its partition support can slice data across multiple files and directories.

0 Votes 0 ·

Hi @Kiran-MSFT , thank you for your reply. I have tried the dataflow activity but it doesn't support connection to a file system on prem. So if it is possible to use the file content partition as we are now and then copy corresponding data under that date in the respective folder partition.

0 Votes 0 ·

IHi @Kiran-MSFT thanks for that input, I think i have found a workaround that to include the dataflow activity. I will post my findings when done. Thanks

0 Votes 0 ·