Use Azure storage with Azure HDInsight clusters

To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Storage Gen 1/Azure Data Lake Storage Gen 2, or a combination. These storage options enable you to safely delete HDInsight clusters that are used for computation without losing user data.

Apache Hadoop supports a notion of the default file system. The default file system implies a default scheme and authority. It can also be used to resolve relative paths. During the HDInsight cluster creation process, you can specify a blob container in Azure Storage as the default file system, or with HDInsight 3.6, you can select either Azure Storage or Azure Data Lake Storage Gen 1/ Azure Data Lake Storage Gen 2 as the default files system with a few exceptions. For the supportability of using Data Lake Storage Gen 1 as both the default and linked storage, see Availability for HDInsight cluster.

In this article, you learn how Azure Storage works with HDInsight clusters. To learn how Data Lake Storage Gen 1 works with HDInsight clusters, see Use Azure Data Lake Storage with Azure HDInsight clusters. For more information about creating an HDInsight cluster, see Create Apache Hadoop clusters in HDInsight.

Important

Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.

Storage account kind Supported services Supported performance tiers Supported access tiers
StorageV2 (general-purpose v2) Blob Standard Hot, Cool, Archive*
Storage (general-purpose v1) Blob Standard N/A
BlobStorage Blob Standard Hot, Cool, Archive*

We don't recommend that you use the default blob container for storing business data. Deleting the default blob container after each use to reduce storage cost is a good practice. The default container contains application and system logs. Make sure to retrieve the logs before deleting the container.

Sharing one blob container as the default file system for multiple clusters isn't supported.

Note

The Archive access tier is an offline tier that has a several hour retrieval latency and isn't recommended for use with HDInsight. For more information, see Archive access tier.

Access files from the cluster

There are several ways you can access the files in Data Lake Storage from an HDInsight cluster. The URI scheme provides unencrypted access (with the wasb: prefix) and SSL encrypted access (with wasbs). We recommend using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.

  • Using the fully qualified name. With this approach, you provide the full path to the file that you want to access.

    wasb://<containername>@<accountname>.blob.core.windows.net/<file.path>/
    wasbs://<containername>@<accountname>.blob.core.windows.net/<file.path>/
    
  • Using the shortened path format. With this approach, you replace the path up to the cluster root with:

    wasb:///<file.path>/
    wasbs:///<file.path>/
    
  • Using the relative path. With this approach, you only provide the relative path to the file that you want to access.

    /<file.path>/
    

Data access examples

Examples are based on an ssh connection to the head node of the cluster. The examples use all three URI schemes. Replace CONTAINERNAME and STORAGEACCOUNT with the relevant values

A few hdfs commands

  1. Create a simple file on local storage.

    touch testFile.txt
    
  2. Create directories on cluster storage.

    hdfs dfs -mkdir wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/
    hdfs dfs -mkdir wasbs:///sampledata2/
    hdfs dfs -mkdir /sampledata3/
    
  3. Copy data from local storage to cluster storage.

    hdfs dfs -copyFromLocal testFile.txt  wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/
    hdfs dfs -copyFromLocal testFile.txt  wasbs:///sampledata2/
    hdfs dfs -copyFromLocal testFile.txt  /sampledata3/
    
  4. List directory contents on cluster storage.

    hdfs dfs -ls wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/
    hdfs dfs -ls wasbs:///sampledata2/
    hdfs dfs -ls /sampledata3/
    

Note

When working with blobs outside of HDInsight, most utilities do not recognize the WASB format and instead expect a basic path format, such as example/jars/hadoop-mapreduce-examples.jar.

Creating a Hive table

Three file locations are shown for illustrative purposes. For actual execution, use only one of the LOCATION entries.

DROP TABLE myTable;
CREATE EXTERNAL TABLE myTable (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/example/data/';
LOCATION 'wasbs:///example/data/';
LOCATION '/example/data/';

Identify storage path from Abmari

  • To identify the complete path to the configured default store, navigate to:

    HDFS > Configs and enter fs.defaultFS in the filter input box.

  • To check if wasb store is configured as secondary storage, navigate to:

    HDFS > Configs and enter blob.core.windows.net in the filter input box.

Blob containers

To use blobs, you first create an Azure Storage account. As part of this, you specify an Azure region where the storage account is created. The cluster and the storage account must be hosted in the same region. The Hive metastore SQL Server database and Apache Oozie metastore SQL Server database must also be located in the same region.

Wherever it lives, each blob you create belongs to a container in your Azure Storage account. This container may be an existing blob that was created outside of HDInsight, or it may be a container that is created for an HDInsight cluster.

The default Blob container stores cluster-specific information such as job history and logs. Don't share a default Blob container with multiple HDInsight clusters. This might corrupt job history. It's recommended to use a different container for each cluster and put shared data on a linked storage account specified in deployment of all relevant clusters rather than the default storage account. For more information on configuring linked storage accounts, see Create HDInsight clusters. However you can reuse a default storage container after the original HDInsight cluster has been deleted. For HBase clusters, you can actually keep the HBase table schema and data by creating a new HBase cluster using the default blob container that is used by an HBase cluster that has been deleted.

Note

The feature that requires secure transfer enforces all requests to your account through a secure connection. Only HDInsight cluster version 3.6 or newer supports this feature. For more information, see Create Apache Hadoop cluster with secure transfer storage accounts in Azure HDInsight.

Interacting with Azure storage

Microsoft provides the following tools to work with Azure Storage:

Tool Linux OS X Windows
Azure portal
Azure CLI
Azure PowerShell
AzCopy

Use additional storage accounts

While creating an HDInsight cluster, you specify the Azure Storage account you want to associate with it. In addition to this storage account, you can add additional storage accounts from the same Azure subscription or different Azure subscriptions during the creation process or after a cluster has been created. For instructions about adding additional storage accounts, see Create HDInsight clusters.

Warning

Using an additional storage account in a different location than the HDInsight cluster is not supported.

Next steps

In this article, you learned how to use HDFS-compatible Azure storage with HDInsight. This allows you to build scalable, long-term, archiving data acquisition solutions and use HDInsight to unlock the information inside the stored structured and unstructured data.

For more information, see: