Solution Idea
If you'd like to see us expand this article with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know with GitHub Feedback!
Extract, transform, and load your big data clusters on demand with Hadoop MapReduce and Apache Spark.
Potential use cases
Azure HDInsight can be used for a variety of scenarios in big data processing. It can be historical data (data that's already collected and stored) or real-time data (data that's directly streamed from the source). The scenarios for processing such data can be summarized in the following doc, Scenarios for using HDInsight. This solution idea covers the data flow for an ETL use case.
Architecture

Data flow
The data flows through the architecture as follows:
Using Azure Data Factory, establish Linked Services to source systems and data stores. Azure Data Factory Pipelines support 90+ connectors that also include generic protocols for data sources where a native connector is not available.
Load data from source systems into Azure data lake with the Copy Data tool.
Azure Data Factory is able to create an on-demand HDInsight cluster. Start by creating an On-Demand HDInsight Linked Service. Next, create a pipeline and use the appropriate HDInsight activity depending on the Hadoop framework being used (i.e. Hive, MapReduce, Spark, etc.).
Trigger the pipeline in Azure Data Factory. The architecture assumes Azure Data Lake store is being used as the file system in the Hadoop script being executed by the HDInsight activity created in Step 3. The script will be executed by an on-demand HDInsight cluster that will write data to a curated area of the data lake.
Components
- Azure Data Factory - Cloud scale data integration service for orchestrating data flow.
- Azure Data Lake Storage - Scalable and cost-effective cloud storage for big data processing.
- Apache Hadoop - Big data distributed processing framework
- Apache Spark - Big data distributed processing framework that supports in-memory processing to boost performance for big data applications.
- Azure HDInsight - Cloud distribution of Hadoop components.
Next steps
Learn more about the component technologies:
- Tutorial: Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory
- Introduction to Azure Data Factory
- Introduction to Azure Data Lake Storage Gen2
- Load data into Azure Data Lake Storage Gen2 with Azure Data Factory
- What is Apache Hadoop in Azure HDInsight?
- Invoke MapReduce Programs from Data Factory
- Use MapReduce in Apache Hadoop on HDInsight
- What is Apache Spark in Azure HDInsight
Related resources
Explore related architectures: