Using Azure Data Lake Storage Gen2 for big data requirements

There are four key stages in big data processing:

  • Ingesting large amounts of data into a data store, at real-time or in batches
  • Processing the data
  • Downloading the data
  • Visualizing the data

This article highlights the options and tools for each processing phase.

For a complete list of Azure services that you can use with Azure Data Lake Storage Gen2, see Integrate Azure Data Lake Storage with Azure services

Ingest the data into Data Lake Storage Gen2

This section highlights the different sources of data and the different ways in which that data can be ingested into a Data Lake Storage Gen2 account.

Ingest data into Data Lake Storage Gen2

Ad hoc data

This represents smaller data sets that are used for prototyping a big data application. There are different ways of ingesting ad hoc data depending on the source of the data.

Here's a list of tools that you can use to ingest ad hoc data.

Data Source Ingest it using
Local computer Azure PowerShell

Azure CLI

Storage Explorer

AzCopy tool
Azure Storage Blob Azure Data Factory

AzCopy tool

DistCp running on HDInsight cluster

Streamed data

This represents data that can be generated by various sources such as applications, devices, sensors, etc. This data can be ingested into Data Lake Storage Gen2 by a variety of tools. These tools will usually capture and process the data on an event-by-event basis in real-time, and then write the events in batches into Data Lake Storage Gen2 so that they can be further processed.

Here's a list of tools that you can use to ingest streamed data.

Tool Guidance
Azure Stream Analytics Quickstart: Create a Stream Analytics job by using the Azure portal
Egress to Azure Data Lake Gen2
Azure HDInsight Storm Write to Apache Hadoop HDFS from Apache Storm on HDInsight

Relational data

You can also source data from relational databases. Over a period of time, relational databases collect huge amounts of data which can provide key insights if processed through a big data pipeline. You can use the following tools to move such data into Data Lake Storage Gen2.

Here's a list of tools that you can use to ingest relational data.

Tool Guidance
Azure Data Factory Copy Activity in Azure Data Factory

Web server log data (upload using custom applications)

This type of dataset is specifically called out because analysis of web server log data is a common use case for big data applications and requires large volumes of log files to be uploaded to Data Lake Storage Gen2. You can use any of the following tools to write your own scripts or applications to upload such data.

Here's a list of tools that you can use to ingest Web server log data.

Tool Guidance
Azure Data Factory Copy Activity in Azure Data Factory
Azure CLI Azure CLI
Azure PowerShell Azure PowerShell

For uploading web server log data, and also for uploading other kinds of data (e.g. social sentiments data), it is a good approach to write your own custom scripts/applications because it gives you the flexibility to include your data uploading component as part of your larger big data application. In some cases this code may take the form of a script or simple command line utility. In other cases, the code may be used to integrate big data processing into a business application or solution.

Data associated with Azure HDInsight clusters

Most HDInsight cluster types (Hadoop, HBase, Storm) support Data Lake Storage Gen2 as a data storage repository. HDInsight clusters access data from Azure Storage Blobs (WASB). For better performance, you can copy the data from WASB into a Data Lake Storage Gen2 account associated with the cluster. You can use the following tools to copy the data.

Here's a list of tools that you can use to ingest data associated with HDInsight clusters.

Tool Guidance
Apache DistCp Use DistCp to copy data between Azure Storage Blobs and Azure Data Lake Storage Gen2
AzCopy tool Transfer data with the AzCopy
Azure Data Factory Copy data to or from Azure Data Lake Storage Gen2 by using Azure Data Factory

Data stored in on-premises or IaaS Hadoop clusters

Large amounts of data may be stored in existing Hadoop clusters, locally on machines using HDFS. The Hadoop clusters may be in an on-premises deployment or may be within an IaaS cluster on Azure. There could be requirements to copy such data to Azure Data Lake Storage Gen2 for a one-off approach or in a recurring fashion. There are various options that you can use to achieve this. Below is a list of alternatives and the associated trade-offs.

Approach Details Advantages Considerations
Use Azure Data Factory (ADF) to copy data directly from Hadoop clusters to Azure Data Lake Storage Gen2 ADF supports HDFS as a data source ADF provides out-of-the-box support for HDFS and first class end-to-end management and monitoring Requires Data Management Gateway to be deployed on-premises or in the IaaS cluster
Use Distcp to copy data from Hadoop to Azure Storage. Then copy data from Azure Storage to Data Lake Storage Gen2 using appropriate mechanism. You can copy data from Azure Storage to Data Lake Storage Gen2 using: You can use open-source tools. Multi-step process that involves multiple technologies

Really large datasets

For uploading datasets that range in several terabytes, using the methods described above can sometimes be slow and costly. In such cases, you can use Azure ExpressRoute.

Azure ExpressRoute lets you create private connections between Azure data centers and infrastructure on your premises. This provides a reliable option for transferring large amounts of data. To learn more, see Azure ExpressRoute documentation.

Process the data

Once the data is available in Data Lake Storage Gen2 you can run analysis on that data using the supported big data applications.

Analyze data in Data Lake Storage Gen2

Here's a list of tools that you can use to run data analysis jobs on data that is stored in Data Lake Storage Gen2.

Tool Guidance
Azure HDInsight Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters
Azure Databricks Azure Data Lake Storage Gen2

Quickstart: Analyze data in Azure Data Lake Storage Gen2 by using Azure Databricks

Tutorial: Extract, transform, and load data by using Azure Databricks

Visualize the data

Use the Power BI connector to create visual representations of data stored in Data Lake Storage Gen2. See Analyze data in Azure Data Lake Storage Gen2 by using Power BI.

Download the data

You might also want to download or move data from Azure Data Lake Storage Gen2 for scenarios such as:

  • Move data to other repositories to interface with your existing data processing pipelines. For example, you might want to move data from Data Lake Storage Gen2 to Azure SQL Database or a SQL Server instance.

  • Download data to your local computer for processing in IDE environments while building application prototypes.

Egress data from Data Lake Storage Gen2

Here's a list of tools that you can use to download data from Data Lake Storage Gen2.

Tool Guidance
Azure Data Factory Copy Activity in Azure Data Factory
Apache DistCp Use DistCp to copy data between Azure Storage Blobs and Azure Data Lake Storage Gen2
Azure Storage Explorer Use Azure Storage Explorer to manage directories, files, and ACLs in Azure Data Lake Storage Gen2
AzCopy tool Transfer data with AzCopy and Blob storage