Extract, transform, and load (ETL) using HDInsight

Data Factory
Data Lake Storage Gen2
HDInsight

Solution Idea

If you'd like to see us expand this article with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know with GitHub Feedback!

Extract, transform, and load your big data clusters on demand with Hadoop MapReduce and Apache Spark.

Potential use cases

Azure HDInsight can be used for a variety of scenarios in big data processing. It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). The scenarios for processing such data can be summarized in the following doc, Scenarios for using HDInsight. This solution idea covers the data flow for an ETL use case.

Architecture

Architecture diagram

Data flow

The data flows through the architecture as follows:

  1. Using Azure Data Factory, establish Linked Services to source systems and data stores. Azure Data Factory Pipelines support 90+ connectors that also include generic protocols for data sources where a native connector is not available.

  2. Load data from source systems into Azure data lake with the Copy Data tool.

  3. Azure Data Factory is able to create an on-demand HDInsight cluster. Start by creating an On-Demand HDInsight Linked Service. Next, create a pipeline and use the appropriate HDInsight activity depending on the Hadoop framework being used (i.e. Hive, MapReduce, Spark, etc.).

  4. Trigger the pipeline in Azure Data Factory. The architecture assumes Azure Data Lake store is being used as the file system in the Hadoop script being executed by the HDInsight activity created in Step 3. The script will be executed by an on-demand HDInsight cluster that will write data to a curated area of the data lake.

Components

  • Azure Data Factory - Cloud scale data integration service for orchestrating data flow.
  • Azure Data Lake Storage - Scalable and cost-effective cloud storage for big data processing.
  • Apache Hadoop - Big data distributed processing framework
  • Apache Spark - Big data distributed processing framework that supports in-memory processing to boost performance for big data applications.
  • Azure HDInsight - Cloud distribution of Hadoop components.

Next steps

Learn more about the component technologies:

Explore related architectures: