question

pankajchaturvedi-6530 avatar image
0 Votes"
pankajchaturvedi-6530 asked MartinJaffer-MSFT commented

Extract files in batches from data lake in Azure Data factory

Hi Team,

In my data lake storage has multiple files almost 100 which contains around 27M records and I am loading on the daily basis(Delta Load) in azure SQL DB Stage table (Using copy activity) which is causing performance issue. After stage table I am using stored procedure activity to load data in target table from stage table.

Currently pipeline is taking almost between 12-15 hours to get complete.
files type is CSV, but let me know if it is possible to implement the logic for Snappy.parquet files

COPY ACTVITY(Data lake to STG table)--->STORED PROCEDURE ACTIVITY(STG To Target)

Is there any way to load the files in batches ?

STEP1:-means first time copy only 10---Load in stg table---stage table to target table then truncate stg table.

STEP2:-means first time copy only 10-20---Load in stg table---stage table to target table then truncate stg table

Could you please help me here.?

I am looking forward your response.


Thanks,
Pankaj

azure-data-factoryazure-sql-databaseazure-data-lake-storage
· 7
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi Team,

I couldn't get any response. Can someone please help me here.?

Thanks,
Pankaj

0 Votes 0 ·

Hello @pankajchaturvedi-6530 and welcome to Microsoft Q&A.

If I understand correctly, the entire process is taking 12-15 hours. Have you looked for which step is the bottleneck? I am wondering if it is the stored procedure, or perhaps number of writers.
Troubleshoot Copy Activity Performance

Is your SQL DB on-premises, or in the cloud? Is it Azure SQL DB service, or one hosted on your VM?

Data Factory does have support for Parquet. There is also an option for Snappy type compression. I'm not sure what "Snappy.parquet" is.

You said you are already using delta / incremental load. Beyond being incremental, what did you have in mind for batches? More frequent incremental loads, or perhaps partitioning the data, or something else?


0 Votes 0 ·

Hi Martin,

Thank you for your reply.

Is it Azure SQL DB service, or one hosted on your VM--Yes, Azure SQL DB hosted on VM.

Beyond being incremental, what did you have in mind for batches? --- Yes for delta load also 27M records I am loading from Azure data lake store to Azure SQL DB table.
So we have set of files(around 100 files) in data lake which contains 27M records. So can we load files one by one for example...
step1:--First extract initial 10 files from storage account and load into SQL DB table. and truncate table stage table
step1:--then extract 10 -20 files from storage account and load into SQL DB table....etc.

I am looking forward your response. Thanks.!








0 Votes 0 ·

@pankajchaturvedi-6530 ·
Okay. Once we have decided how to break up the files into groups, we can use ForEach to iterate over each group.

If your Data Lake is Gen1, you can use the "Name Range" option for a few files at a time.
If your Date Lake is Gen2, you can use the "Wildcard File Path" if your files are already batched into folders, and you want to do a folder at a time.
If your Data Lake is blob, you can use the blob prefix option.

To do only 1 file at a time, use the "File path in Dataset". This works for all data lakes. You will need to parameterize the dataset. Pictures below.
93745-image.png
93576-image.png
When using a parameterized dataset in copy activity, you will need to pass in values. @item() points to the current item in the forEach loop.
93728-image.png


0 Votes 0 ·
image.png (41.1 KiB)
image.png (56.8 KiB)
image.png (90.2 KiB)

Hi Martin,

Thanks for your quick response.

If your Date Lake is Gen2, you can use the "Wildcard File Path" if your files are already batched into folders, and you want to do a folder at a time.---> Yes my Data Lake is Gen2 but my files not batched into folders, my files are in single folder(100 files) so I want to extract the files in batches from single folder.

step1:--First extract initial 10 files from storage account folder and load into SQL DB table. and truncate stage table
step1:--then extract 10 -20 files from storage account and load into SQL DB table....etc.

I am looking forward your response. Thanks.!



0 Votes 0 ·

HI Team, could you please respond here.Thanks!

0 Votes 0 ·
Show more comments
pankajchaturvedi-6530 avatar image
0 Votes"
pankajchaturvedi-6530 answered

Hi Team, Could you please help me here.? Thanks!

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered MartinJaffer-MSFT commented

@pankajchaturvedi-6530

Three step process:
1. Get list of files in the folder
2. Exclude any not-files (subfolders)
3. ForEach Loop over the list of files. Pass the file name to Copy activity.

Set up dataset and Get Metadata activity to fetch files in folder
94327-image.png
94439-image.png

Filter activity:
Items: @activity('Get List of Files').output.childItems
Condition: @equals(item().type,'File')

Set up ForEach loop
94470-image.png
Batch count: 10, or however many you want to be processed at a time
Items: @activity('Exclude Folders').output.Value

Parameterize the dataset for copy activity.
94409-image.png
94515-image.png

Pass file name to copy activity
94410-image.png



image.png (126.5 KiB)
image.png (71.8 KiB)
image.png (97.2 KiB)
image.png (54.4 KiB)
image.png (79.9 KiB)
image.png (114.4 KiB)
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi Martin, Thank you so much for your effort and providing me the solution.

I will implement the same and get back to you.

Could you please describe why are we using Filter(Exclude Folder) activity as we have files in the single folder?

Thanks!

0 Votes 0 ·

@pankajchaturvedi-6530 I included the filter, just in case there were sub-folders. I tried to think of anything that could go wrong.

You are correct it is not necessary. If you are confident there are no sub-folders, only files, you can remove that step.

When removing the filter, the forEach items will need to be adjusted to point to the Get Metadata stuff.

0 Votes 0 ·

@pankajchaturvedi-6530 how is it going? Were you able to implement the solution?

0 Votes 0 ·