question

PavanKumarChakilam-4986 avatar image
0 Votes"
PavanKumarChakilam-4986 asked MartinJaffer-MSFT edited

ADF : processing large volume of data with API calls in Azure datafactory pipeline

Hi All, can someone give the best possible design for the below problem statement

Problem statement:
we are getting a file with more than 5k records. Max volume is 700k and min 3k.

ADF pipeline needs to call API to process each record.

our current design:
ADF pipe line creating with copy activity to load the records into a staging table and using look up activity (which will pull only 5k records) and foreach record calling API to process the record.

Foreach also has the batch limitation to process the records with min 20 and max 50.

as there is limitation of 5k for look up, we are unable to process more than 5k records. to process these 5k records also taking more time as we have to call API for reach record.

possible given solutions :
we requested source team to send the files by dividing records. for example if they are going to send 20k records, we asked them to send 4 files (test_1_ddmmyyyy.txt, test_2_ddmmyyy.txt etc). each file contains 5k.

we have created 4 triggers and based on the file name we are processing the file in parallel. so that if we get 4 files, 4 instances of pipelines triggers and will do the processing of records in parallel. but when we check the execution time each pipe line took 3 hours.

the up stream said that the max count of records is 65k. so we decided to crate 13 triggers to process those files.

is there any better way to design this solution? in some cases we are going to get the file with volume of 700k (it means they need to divide the files with each 5k and total count is 140k files and 140 triggers). this is seems like crazy volume and design. but we don't see any other solution as our pipe lines has web api calls for each record.

API is not taking multiple record values at a time. so we don't have any other approach.

kindly check and the problem and provide your valuable suggestions.

azure-data-factory
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

MartinJaffer-MSFT avatar image
0 Votes"
MartinJaffer-MSFT answered MartinJaffer-MSFT edited

Hello and welcome to Microsoft Q&A @PavanKumarChakilam-4986 . Thank you for laying out the requirements and challenges so methodically.

Before trying to create a solution, I would like to question and underlying idea:
Given all the API requests, is Azure Data Factory the best tool for the job? Maybe a different service would be appropriate?
It is unfortunate you do not have a bulk-operation available on the API.

I could imagine a homebrew solution doing 3k-5k with less overhead than the Data Factory , but 700k is definitely parallel-computing territory if you want it done fast.
So, in a general sense, I imagine one optimal solution being:

  1. Check the incoming file size, and based upon that, create a pool of worker nodes W in size.

  2. Give each of the worker nodes a chunk of records R in size

  3. The workers go at their own speed, processing the rows to API. When a worker finishes a chunk, give it another chunk of records.

  4. Continue until there are no more chunks to give.

This gives us 2 variables to fine tune. W the number of workers per file size. R the size of the chunk of records.
If there are too many workers, the API might throttle or shut down.
If the chunk size is too small, time is wasted waiting for or passing data. If the chunk size is too large, the worker needs more memory.
I chose to give workers multiple chunks, instead of just dividing up the file into W pieces to reduce memory requirements.

· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks for your response @MartinJaffer-MSFT .

Given all the API requests, is Azure Data Factory the best tool for the job? Maybe a different service would be appropriate?
you are correct. I hate the design who architecture this way to call APIs from ADF. yes may be different service would be right solution.

as you provided the workers solutions to process based on the file size, could you please give me detail steps and example how can we achieve it.

appreciate your help!

0 Votes 0 ·
MartinJaffer-MSFT avatar image MartinJaffer-MSFT PavanKumarChakilam-4986 ·

Sorry for the delay @PavanKumarChakilam-4986 .

Might Data Flow's external call work better? Data Flow can read the whole file, and the external call transformation works row by row, effectively doing what you are trying to. This skipped my mind earlier. Try this first, as it may mean less learning than the below discussion.

Many services, including ADF, try to do this parallel architecture for you behind the scenes. Most of the ones I recommend will require some level of coding ability. The more technically capable you are, the more options are available to you. My list is not exhaustive.

If you use Azure Synapse, the Spark pool or Notebook is the way to go. Synapse is Data Factory + more database and compute and analytics features.
Similar to Spark in Synapse, is Azure Databricks.
HDInsight is quite heavy and a steep learning curve. Possibly overkill for this use case.
You could go low-level and manage Azure virtual machines and Azure virtual networks yourself.
You could write a parallel program on a slightly larger virtual machine.

I would recommend Azure Synapse, as it manages the pool for you, has more languages to choose from, and many integration options.
Reading data into spark
(I'm having trouble finding a certain article on notebook calling other notebooks.)

0 Votes 0 ·

Thanks for your response @MartinJaffer-MSFT .

We have already tried with Dataflows with external calls. But unfortunately ended with timeout issues. our current environment does not have much abilities to make use of Synapse and databricks.

0 Votes 0 ·
Show more comments