How to query 3rd party Azure DataLake Gen2 and only store the results

JasonW-5564 161 Reputation points
2020-12-15T16:14:02.687+00:00

First, what I am trying to do is I want to query and aggregate raw JSON files stored in a 3rd party's Azure Data Lake (Gen2) and store those aggregates in my own data lake or relation db. I do not want to physically copy all of those raw JSON files because of the data volume and velocity, as well as adding the additional storage cost and introduce any un necessary latency. I am looking for how to do that/what is the best tool set to use for this.

A bit more detail:

  • The data in the 3rd party's Azure Data Lake Gen2.
  • I have read only access to that data lake. Currently via a SAS, that can change if SAS not supported.
  • The data files in the lake are stored in the following folder structure yyyy/mm/dd/hh. There are thousands of JSON files per each hh folder.
  • Files are added to data lake every minute of every day to most recent hh folder.
  • New files are only added and do not change once added, so once I query a folder, other than the most current, it never changes again/needs to be re-queried.
  • I want to be able to query the files ASAP they are posted to the 3rd party's data lake.
  • Once I query the files, I have no other need for them and do not need to import/keep them.
Azure SQL Database
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,358 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,422 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,947 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,650 questions
{count} vote

1 answer

Sort by: Most helpful
  1. HarithaMaddi-MSFT 10,136 Reputation points
    2020-12-16T12:38:55.137+00:00

    Hi @JasonW-5564 ,

    Welcome to Microsoft Q&A Platform. Thanks for posting the query.

    Dataflows in data factory is a suitable cloud ETL tool for such requirements that has lot of transformation activities which will help in working with the data before loading from source to sink.

    Event Triggers in Azure Data Factory are useful in running the pipelines as soon as files are added into the ADLS and incremental load approach using last modified date in data factory is useful in controlling loading files uploaded in certain time duration.

    Please suggest for further queries and we will be glad to assist.

    --

    • Please accept an answer if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification.