Upload data for Hadoop jobs in HDInsight
Azure HDInsight provides a full-featured Hadoop distributed file system (HDFS) over Azure Storage and Azure Data Lake Store. Azure Storage and Data lake Store are designed as an HDFS extension to provide a seamless experience to customers. They enable the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Azure Storage and Data Lake Store are distinct file systems that are optimized for storage of data and computations on that data. For information about the benefits of using Azure Storage, see Use Azure Storage with HDInsight and Use Data Lake Store with HDInsight.
Note the following requirements before you begin:
- An Azure HDInsight cluster. For instructions, see Get started with Azure HDInsight or Create HDInsight clusters.
Knowledge of the following two articles:
Upload data to Azure Storage
Microsoft provides the following utilities to work with Azure Storage:
|Azure Command-Line Interface||✔||✔||✔|
While the Azure CLI, Azure PowerShell, and AzCopy can all be used from outside Azure, the Hadoop command is only available on the HDInsight cluster. And the command only allows loading data from the local file system into Azure Storage.
The Azure CLI is a cross-platform tool that allows you to manage Azure services. Use the following steps to upload data to Azure Storage:
Azure CLI support for managing HDInsight resources using Azure Service Manager (ASM) is deprecated, and was removed on January 1, 2017. The steps in this document use the new Azure CLI commands that work with Azure Resource Manager.
Follow the steps in Install and configure Azure CLI to install the latest version of the Azure CLI. If you have scripts that need to be modified to use the new commands that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.
- Install and configure the Azure CLI for Mac, Linux, and Windows.
Open a command prompt, bash, or other shell, and use the following to authenticate to your Azure subscription.
When prompted, enter the user name and password for your subscription.
Enter the following command to list the storage accounts for your subscription:
azure storage account list
Select the storage account that contains the blob you want to work with, then use the following command to retrieve the key for this account:
azure storage account keys list <storage-account-name>
This command returns the Primary and the Secondary keys. Copy the Primary key value because it will be used in the next steps.
Use the following command to retrieve a list of blob containers within the storage account:
azure storage container list -a <storage-account-name> -k <primary-key>
Use the following commands to upload and download files to the blob:
To upload a file:
azure storage blob upload -a <storage-account-name> -k <primary-key> <source-file> <container-name> <blob-name>
To download a file:
azure storage blob download -a <storage-account-name> -k <primary-key> <container-name> <blob-name> <destination-file>
If you always work with the same storage account, you can set the following environment variables instead of specifying the account and key for every command:
- AZURE_STORAGE_ACCOUNT: The storage account name
- AZURE_STORAGE_ACCESS_KEY: The storage account key
Azure PowerShell is a scripting environment that you can use to control and automate the deployment and management of your workloads in Azure. For information about configuring your workstation to run Azure PowerShell, see Install and configure Azure PowerShell.
Azure PowerShell support for managing HDInsight resources using Azure Service Manager is deprecated, and was removed on January 1, 2017. The steps in this document use the new HDInsight cmdlets that work with Azure Resource Manager.
Please follow the steps in Install and configure Azure PowerShell to install the latest version of Azure PowerShell. If you have scripts that need to be modified to use the new cmdlets that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.
To upload a local file to Azure Storage
- Open the Azure PowerShell console as instructed in Install and configure Azure PowerShell.
Set the values of the first five variables in the following script:
$resourceGroupName = "<AzureResourceGroupName>" $storageAccountName = "<StorageAccountName>" $containerName = "<ContainerName>" $fileName ="<LocalFileName>" $blobName = "<BlobName>" # Get the storage account key $storageAccountKey = (Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -Name $storageAccountName).Value # Create the storage context object $destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageaccountkey # Copy the file from local workstation to the Blob container Set-AzureStorageBlobContent -File $fileName -Container $containerName -Blob $blobName -context $destContext
Paste the script into the Azure PowerShell console to run it to copy the file.
For example PowerShell scripts created to work with HDInsight, see HDInsight tools.
AzCopy is a command-line tool that is designed to simplify the task of transferring data into and out of an Azure Storage account. You can use it as a standalone tool or incorporate this tool in an existing application. Download AzCopy.
The AzCopy syntax is:
AzCopy <Source> <Destination> [filePattern [filePattern...]] [Options]
For more information, see AzCopy - Uploading/Downloading files for Azure Blobs.
Azcopy on Linux preview is available. See Announcing AzCopy on Linux Preview.
Hadoop command line
The Hadoop command line is only useful for storing data into Azure storage blob when the data is already present on the cluster head node.
In order to use the Hadoop command, you must first connect to the headnode using one of the following methods:
- Windows-based HDInsight: Connect using Remote Desktop
- Linux-based HDInsight: Connect using SSH or PuTTY.
Once connected, you can use the following syntax to upload a file to storage.
hadoop -copyFromLocal <localFilePath> <storageFilePath>
hadoop fs -copyFromLocal data.txt /example/data/data.txt
Because the default file system for HDInsight is in Azure Storage, /example/data.txt is actually in Azure Storage. You can also refer to the file as:
For a list of other Hadoop commands that work with files, see http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
On HBase clusters, the default block size used when writing data is 256 KB. While this works fine when using HBase APIs or REST APIs, using the
hdfs dfs commands to write data larger than ~12 GB results in an error. For more information, see the storage exception for write on blob section in this article.
There are also several applications that provide a graphical interface for working with Azure Storage. The following table is a list of a few of these applications:
|Microsoft Visual Studio Tools for HDInsight||✔||✔||✔|
|Azure Storage Explorer||✔||✔||✔|
|Cloud Storage Studio 2||✔|
Visual Studio Tools for HDInsight
For more information, see Navigate the linked resources.
Azure Storage Explorer
Azure Storage Explorer is a useful tool for inspecting and altering the data in blobs. It is a free, open source tool that can be downloaded from http://storageexplorer.com/. The source code is available from this link as well.
Before using the tool, you must know your Azure storage account name and account key. For instructions about getting this information, see the "How to: View, copy, and regenerate storage access keys" section of Create, manage, or delete a storage account.
Run Azure Storage Explorer. If it is the first time you have run the Storage Explorer, you are prompted for the _Storage account name and Storage account key. If you have run it before, use the Add button to add a new storage account name and key.
Enter the name and key for the storage account used by your HDInsight cluster and then select SAVE & OPEN.
- In the list of containers to the left of the interface, click the name of the container that is associated with your HDInsight cluster. By default, it is the name of the HDInsight cluster, but may be different if you entered a specific name when creating the cluster.
From the tool bar, select the upload icon.
Specify a file to upload, and then click Open. When prompted, select Upload to upload the file to the root of the storage container. If you want to upload the file to a specific path, enter the path in the Destination field and then select Upload.
Once the file has finished uploading, you can use it from jobs on the HDInsight cluster.
Mount Azure Storage as Local Drive
Upload using services
Azure Data Factory
The Azure Data Factory service is a fully managed service for composing data storage, data processing, and data movement services into streamlined, scalable, and reliable data production pipelines.
Azure Data Factory can be used to move data into Azure Storage, or to create data pipelines that directly use HDInsight features such as Hive and Pig.
For more information, see the Azure Data Factory documentation.
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.
For more information, see Use Sqoop with HDInsight.
Azure Storage can also be accessed using an Azure SDK from the following programming languages:
For more information on installing the Azure SDKs, see Azure downloads
Storage exception for write on blob
Symptoms: When using the
hdfs dfs commands to write files that are ~12 GB or larger on an HBase cluster, you may encounter the following error:
ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for write on Blob : example/test_large_file.bin._COPYING_ Exception details: null Error Code : RequestBodyTooLarge copyFromLocal: java.io.IOException at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:661) at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:366) at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:350) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: com.microsoft.azure.storage.StorageException: The request body is too large and exceeds the maximum permissible limit. at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89) at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307) at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:182) at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlockInternal(CloudBlockBlob.java:816) at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlock(CloudBlockBlob.java:788) at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:354) ... 7 more
Cause: HBase on HDInsight clusters default to a block size of 256KB when writing to Azure storage. While it works for HBase APIs or REST APIs, it results in an error when using the
hdfs dfs command-line utilities.
fs.azure.write.request.size to specify a larger block size. You can do this on a per-use basis by using the
-D parameter. The following command is an example using this parameter with the
hadoop -fs -D fs.azure.write.request.size=4194304 -copyFromLocal test_large_file.bin /example/data
You can also increase the value of
fs.azure.write.request.size globally by using Ambari. The following steps can be used to change the value in the Ambari Web UI:
In your browser, go to the Ambari Web UI for your cluster. This is https://CLUSTERNAME.azurehdinsight.net, where CLUSTERNAME is the name of your cluster.
When prompted, enter the admin name and password for the cluster.
- From the left side of the screen, select HDFS, and then select the Configs tab.
- In the Filter... field, enter
fs.azure.write.request.size. This displays the field and current value in the middle of the page.
- Change the value from 262144 (256KB) to the new value. For example, 4194304 (4MB).
For more information on using Ambari, see Manage HDInsight clusters using the Ambari Web UI.
Now that you understand how to get data into HDInsight, read the following articles to learn how to perform analysis: