question

RamanathanDakshinaMurthy-0993 avatar image
0 Votes"
RamanathanDakshinaMurthy-0993 asked SathyamoorthyVijayakumar-MSFT commented

How to model thousands of files from Azure Data Lake Gen 2 to Single dataset for analysis?

Hi,

I have an initial 1000s of delimited files in Azure Data Lake Gen 2 storage account. I need to read all these files and create them as single dataset for analysis. This dataset must be preserved for future files. After these files are processed, there will be only few files every day which will need to be read, then the new data should be added to the existing dataset. Business users might use this single modeled data set for analysis.


Currently it takes lot of time to query all these files. We want a faster and cost effective approach.

azure-data-factoryazure-synapse-analyticsazure-databricksazure-data-lake-storageazure-hdinsight
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@RamanathanDakshinaMurthy-0993 - Just checking in to see if the below answer provided by @ThomasBoersma this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.

0 Votes 0 ·

1 Answer

ThomasBoersma avatar image
1 Vote"
ThomasBoersma answered ThomasBoersma commented

Hi @RamanathanDakshinaMurthy-0993 ,

A possible solution is to create a staging folder inside you Azure Data Lake Gen2 (ADLS Gen2). Inside this staging folder you will place the delimited files that needs to be processed / transformed. Processing can be done with a Data Flow within Azure Synapse or Azure Data Factory, if the transformations are too complex you could use a notebook via Databricks or with the integrated Synapse notebooks. The source of the process should import all the delimited files, for example with a wildcard: staging_folder\*.csv.
When the data is loaded into the data flow or data frame you can process data. After the processing is done, you can store the data as a Delta Lake table within your ADLS Gen2. The delimited files should be deleted from the staging folder at this moment, because you don't want that the files are processed for a second time. At the end you can create a Synapse database view (with the Synpase serverless pool) on top of the Delta Lake table within you ADLS Gen2 for retrieving the data. I added an image where you can see in an abstract way the steps I described.
A benefit of Delta Lake is that you can create partitions on your data, what makes it more performant and cost effective, if you set the partitions right. It also saves the files as Parquet which is more compressed than csv. This works with hundreds of millions of rows, but if your dataset grows by more than a few billions I would recommend replacing the Delta Lake table with a Synapse Dedicated database. A dedicated database is very expensive compared with a serverless pool, but a lot faster.

Hope this helps.


134209-image.png




image.png (681.6 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi,

Thanks for your reply. I still have few questions though. Reading those files in Azure data factory can create performance issue because all those billions of data might be in memory. We would like to use the advantage of clusters, compute options, parallel processing and read those files lot quicker, process them and query them for analytics. Any suggestions?

0 Votes 0 ·
ThomasBoersma avatar image ThomasBoersma RamanathanDakshinaMurthy-0993 ·

Hi,

You can adjust the number of clusters according your needs in a data flow, see here how. At the end, the resulting data flows are executed as activities within ADF/Synapse pipelines that use scaled-out Apache Spark clusters. ADF/Synapse handles all the code translation, path optimization, and execution of your data flow jobs.

But if you want to change more of the settings regarding CPU/RAM I would suggest that you use Databricks, see here the possibilities of configuring clusters on Databricks. With Databricks you will always have more settings to change, however you need to write your own code and that sometimes takes longer time.

Another option is that you create durable Azure Functions with a Fan out/fan in pattern.


0 Votes 0 ·