Upload data for Hadoop jobs in HDInsight
Azure HDInsight provides a full-featured Hadoop distributed file system (HDFS) over Azure Blob storage. It is designed as an HDFS extension to provide a seamless experience to customers. It enables the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Azure Blob storage and HDFS are distinct file systems that are optimized for storage of data and computations on that data. For information about the benefits of using Azure Blob storage, see Use Azure Blob storage with HDInsight.
Note the following requirement before you begin:
- An Azure HDInsight cluster. For instructions, see Get started with Azure HDInsight or Provision HDInsight clusters.
Why blob storage?
Azure HDInsight clusters are typically deployed to run MapReduce jobs, and the clusters are dropped after these jobs complete. Keeping the data in the HDFS clusters after computations are complete would be an expensive way to store this data. Azure Blob storage is a highly available, highly scalable, high capacity, low cost, and shareable storage option for data that is to be processed using HDInsight. Storing data in a blob enables the HDInsight clusters that are used for computation to be safely released without losing data.
Azure Blob storage containers store data as key/value pairs, and there is no directory hierarchy. However the "/" character can be used within the key name to make it appear as if a file is stored within a directory structure. HDInsight sees these as if they are actual directories.
For example, a blob's key may be input/log1.txt. No actual "input" directory exists, but due to the presence of the "/" character in the key name, it has the appearance of a file path.
Because of this, if you use Azure Explorer tools you may notice some 0 byte files. These files serve two purposes:
- If there are empty folders, they mark of the existence of the folder. Azure Blob storage is clever enough to know that if a blob called foo/bar exists, there is a folder called foo. But the only way to signify an empty folder called foo is by having this special 0 byte file in place.
- They hold special metadata that is needed by the Hadoop file system, notably the permissions and owners for the folders.
Microsoft provides the following utilities to work with Azure Blob storage:
|Azure Command-Line Interface||✔||✔||✔|
While the Azure CLI, Azure PowerShell, and AzCopy can all be used from outside Azure, the Hadoop command is only available on the HDInsight cluster and only allows loading data from the local file system into Azure Blob storage.
The Azure CLI is a cross-platform tool that allows you to manage Azure services. Use the following steps to upload data to Azure Blob storage:
Azure CLI support for managing HDInsight resources using Azure Service Manager (ASM) is deprecated, and was removed on January 1, 2017. The steps in this document use the new Azure CLI commands that work with Azure Resource Manager.
Please follow the steps in Install and configure Azure CLI to install the latest version of the Azure CLI. If you have scripts that need to be modified to use the new commands that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.
- Install and configure the Azure CLI for Mac, Linux and Windows.
Open a command prompt, bash, or other shell, and use the following to authenticate to your Azure subscription.
When prompted, enter the user name and password for your subscription.
Enter the following command to list the storage accounts for your subscription:
azure storage account list
Select the storage account that contains the blob you want to work with, then use the following command to retrieve the key for this account:
azure storage account keys list <storage-account-name>
This should return Primary and Secondary keys. Copy the Primary key value because it will be used in the next steps.
Use the following command to retrieve a list of blob containers within the storage account:
azure storage container list -a <storage-account-name> -k <primary-key>
Use the following commands to upload and download files to the blob:
To upload a file:
azure storage blob upload -a <storage-account-name> -k <primary-key> <source-file> <container-name> <blob-name>
To download a file:
azure storage blob download -a <storage-account-name> -k <primary-key> <container-name> <blob-name> <destination-file>
If you will always be working with the same storage account, you can set the following environment variables instead of specifying the account and key for every command:
- AZURE_STORAGE_ACCOUNT: The storage account name
- AZURE_STORAGE_ACCESS_KEY: The storage account key
Azure PowerShell is a scripting environment that you can use to control and automate the deployment and management of your workloads in Azure. For information about configuring your workstation to run Azure PowerShell, see Install and configure Azure PowerShell.
Azure PowerShell support for managing HDInsight resources using Azure Service Manager is deprecated, and was removed on January 1, 2017. The steps in this document use the new HDInsight cmdlets that work with Azure Resource Manager.
Please follow the steps in Install and configure Azure PowerShell to install the latest version of Azure PowerShell. If you have scripts that need to be modified to use the new cmdlets that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.
To upload a local file to Azure Blob storage
- Open the Azure PowerShell console as instructed in Install and configure Azure PowerShell.
Set the values of the first five variables in the following script:
$resourceGroupName = "<AzureResourceGroupName>" $storageAccountName = "<StorageAccountName>" $containerName = "<ContainerName>" $fileName ="<LocalFileName>" $blobName = "<BlobName>" # Get the storage account key $storageAccountKey = (Get-AzureRmStorageAccountKey -ResourceGroupName $resourceGroupName -Name $storageAccountName).Value # Create the storage context object $destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageaccountkey # Copy the file from local workstation to the Blob container Set-AzureStorageBlobContent -File $fileName -Container $containerName -Blob $blobName -context $destContext
- Paste the script into the Azure PowerShell console to run it to copy the file.
For example PowerShell scripts created to work with HDInsight, see HDInsight tools.
AzCopy is a command-line tool that is designed to simplify the task of transferring data into and out of an Azure Storage account. You can use it as a standalone tool or incorporate this tool in an existing application. Download AzCopy.
The AzCopy syntax is:
AzCopy <Source> <Destination> [filePattern [filePattern...]] [Options]
For more information, see AzCopy - Uploading/Downloading files for Azure Blobs.
The Hadoop command line is only useful for storing data into blob storage when the data is already present on the cluster head node.
In order to use the Hadoop command, you must first connect to the headnode using one of the following methods:
- Windows-based HDInsight: Connect using Remote Desktop
- Linux-based HDInsight: Connect using SSH (the SSH command or PuTTY)
Once connected, you can use the following syntax to upload a file to storage.
hadoop -copyFromLocal <localFilePath> <storageFilePath>
hadoop fs -copyFromLocal data.txt /example/data/data.txt
Because the default file system for HDInsight is in Azure Blob storage, /example/data.txt is actually in Azure Blob storage. You can also refer to the file as:
For a list of other Hadoop commands that work with files, see http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
On HBase clusters, the default block size used when writing data is 256KB. While this works fine when using HBase APIs or REST APIs, using the
hdfs dfs commands to write data larger than ~12GB results in an error. See the storage exception for write on blob section below for more information.
There are also several applications that provide a graphical interface for working with Azure Storage. The following is a list of a few of these applications:
|Microsoft Visual Studio Tools for HDInsight||✔||✔||✔|
|Azure Storage Explorer||✔||✔||✔|
|Cloud Storage Studio 2||✔|
Visual Studio Tools for HDInsight
For more information, see Navigate the linked resources.
Azure Storage Explorer is a useful tool for inspecting and altering the data in blobs. It is a free, open source tool that can be downloaded from http://storageexplorer.com/. The source code is available from this link as well.
Before using the tool, you must know your Azure storage account name and account key. For instructions about getting this information, see the "How to: View, copy and regenerate storage access keys" section of Create, manage, or delete a storage account.
Run Azure Storage Explorer. If this is the first time you have run the Storage Explorer, you will be prompted for the _Storage account name and Storage account key. If you have run it before, use the Add button to add a new storage account name and key.
Enter the name and key for the storage account used by your HDinsight cluster and then select SAVE & OPEN.
- In the list of containers to the left of the interface, click the name of the container that is associated with your HDInsight cluster. By default, this is the name of the HDInsight cluster, but may be different if you entered a specific name when creating the cluster.
From the tool bar, select the upload icon.
Specify a file to upload, and then click Open. When prompted, select Upload to upload the file to the root of the storage container. If you want to upload the file to a specific path, enter the path in the Destination field and then select Upload.
Once the file has finished uploading, you can use it from jobs on the HDInsight cluster.
Mount Azure Blob Storage as Local Drive
Azure Data Factory
The Azure Data Factory service is a fully managed service for composing data storage, data processing, and data movement services into streamlined, scalable, and reliable data production pipelines.
Azure Data Factory can be used to move data into Azure Blob storage, or to create data pipelines that directly use HDInsight features such as Hive and Pig.
For more information, see the Azure Data Factory documentation.
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle into the Hadoop distributed file system (HDFS), transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.
For more information, see Use Sqoop with HDInsight.
Azure Blob storage can also be accessed using an Azure SDK from the following programming languages:
For more information on installing the Azure SDKs, see Azure downloads
Symptoms: When using the
hdfs dfs commands to write files that are ~12GB or larger on an HBase cluster, you may encounter the following error:
ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for write on Blob : example/test_large_file.bin._COPYING_ Exception details: null Error Code : RequestBodyTooLarge copyFromLocal: java.io.IOException at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:661) at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:366) at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:350) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: com.microsoft.azure.storage.StorageException: The request body is too large and exceeds the maximum permissible limit. at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89) at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307) at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:182) at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlockInternal(CloudBlockBlob.java:816) at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlock(CloudBlockBlob.java:788) at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:354) ... 7 more
Cause: HBase on HDInsight clusters default to a block size of 256KB when writing to Azure storage. While this works for HBase APIs or REST APIs, it will result in an error when using the
hdfs dfs command-line utilities.
fs.azure.write.request.size to specify a larger block size. You can do this on a per-use basis by using the
-D parameter. The following is an example using this parameter with the
hadoop -fs -D fs.azure.write.request.size=4194304 -copyFromLocal test_large_file.bin /example/data
You can also increase the value of
fs.azure.write.request.size globally by using Ambari. The following steps can be used to change the value in the Ambari Web UI:
In your browser, go to the Ambari Web UI for your cluster. This is https://CLUSTERNAME.azurehdinsight.net, where CLUSTERNAME is the name of your cluster.
When prompted, enter the admin name and password for the cluster.
- From the left side of the screen, select HDFS, and then select the Configs tab.
- In the Filter... field, enter
fs.azure.write.request.size. This will display the field and current value in the middle of the page.
- Change the value from 262144 (256KB) to the new value. For example, 4194304 (4MB).
For more information on using Ambari, see Manage HDInsight clusters using the Ambari Web UI.
Now that you understand how to get data into HDInsight, read the following articles to learn how to perform analysis: