Use Azure storage with Azure HDInsight clusters
To analyze data in HDInsight cluster, you can store the data either in Azure Storage, Azure Data Lake Storage Gen 1/Azure Data Lake Storage Gen 2, or a combination. These storage options enable you to safely delete HDInsight clusters that are used for computation without losing user data.
Apache Hadoop supports a notion of the default file system. The default file system implies a default scheme and authority. It can also be used to resolve relative paths. During the HDInsight cluster creation process, you can specify a blob container in Azure Storage as the default file system, or with HDInsight 3.6, you can select either Azure Storage or Azure Data Lake Storage Gen 1/ Azure Data Lake Storage Gen 2 as the default files system with a few exceptions. For the supportability of using Data Lake Storage Gen 1 as both the default and linked storage, see Availability for HDInsight cluster.
In this article, you learn how Azure Storage works with HDInsight clusters. To learn how Data Lake Storage Gen 1 works with HDInsight clusters, see Use Azure Data Lake Storage with Azure HDInsight clusters. For more information about creating an HDInsight cluster, see Create Apache Hadoop clusters in HDInsight.
Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.
|Storage account kind||Supported services||Supported performance tiers||Not supported performance tiers||Supported access tiers|
|StorageV2 (general-purpose v2)||Blob||Standard||Premium||Hot, Cool, Archive*|
|Storage (general-purpose v1)||Blob||Standard||Premium||N/A|
|BlobStorage||Blob||Standard||Premium||Hot, Cool, Archive*|
We don't recommend that you use the default blob container for storing business data. Deleting the default blob container after each use to reduce storage cost is a good practice. The default container contains application and system logs. Make sure to retrieve the logs before deleting the container.
Sharing one blob container as the default file system for multiple clusters isn't supported.
The Archive access tier is an offline tier that has a several hour retrieval latency and isn't recommended for use with HDInsight. For more information, see Archive access tier.
Access files from within cluster
There are several ways you can access the files in Data Lake Storage from an HDInsight cluster. The URI scheme provides unencrypted access (with the wasb: prefix) and TLS encrypted access (with wasbs). We recommend using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.
Using the fully qualified name. With this approach, you provide the full path to the file that you want to access.
Using the shortened path format. With this approach, you replace the path up to the cluster root with:
Using the relative path. With this approach, you only provide the relative path to the file that you want to access.
Data access examples
Examples are based on an ssh connection to the head node of the cluster. The examples use all three URI schemes. Replace
STORAGEACCOUNT with the relevant values
A few hdfs commands
Create a simple file on local storage.
Create directories on cluster storage.
hdfs dfs -mkdir wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/ hdfs dfs -mkdir wasbs:///sampledata2/ hdfs dfs -mkdir /sampledata3/
Copy data from local storage to cluster storage.
hdfs dfs -copyFromLocal testFile.txt wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/ hdfs dfs -copyFromLocal testFile.txt wasbs:///sampledata2/ hdfs dfs -copyFromLocal testFile.txt /sampledata3/
List directory contents on cluster storage.
hdfs dfs -ls wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/ hdfs dfs -ls wasbs:///sampledata2/ hdfs dfs -ls /sampledata3/
When working with blobs outside of HDInsight, most utilities do not recognize the WASB format and instead expect a basic path format, such as
Creating a Hive table
Three file locations are shown for illustrative purposes. For actual execution, use only one of the
DROP TABLE myTable; CREATE EXTERNAL TABLE myTable ( t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/example/data/'; LOCATION 'wasbs:///example/data/'; LOCATION '/example/data/';
Access files from outside cluster
Microsoft provides the following tools to work with Azure Storage:
Identify storage path from Ambari
To identify the complete path to the configured default store, navigate to:
HDFS > Configs and enter
fs.defaultFSin the filter input box.
To check if wasb store is configured as secondary storage, navigate to:
HDFS > Configs and enter
blob.core.windows.netin the filter input box.
To obtain the path using Ambari REST API, see Get the default storage.
To use blobs, you first create an Azure Storage account. As part of this, you specify an Azure region where the storage account is created. The cluster and the storage account must be hosted in the same region. The Hive metastore SQL Server database and Apache Oozie metastore SQL Server database must also be located in the same region.
Wherever it lives, each blob you create belongs to a container in your Azure Storage account. This container may be an existing blob that was created outside of HDInsight, or it may be a container that is created for an HDInsight cluster.
The default Blob container stores cluster-specific information such as job history and logs. Don't share a default Blob container with multiple HDInsight clusters. This might corrupt job history. It's recommended to use a different container for each cluster and put shared data on a linked storage account specified in deployment of all relevant clusters rather than the default storage account. For more information on configuring linked storage accounts, see Create HDInsight clusters. However you can reuse a default storage container after the original HDInsight cluster has been deleted. For HBase clusters, you can actually keep the HBase table schema and data by creating a new HBase cluster using the default blob container that is used by an HBase cluster that has been deleted.
The feature that requires secure transfer enforces all requests to your account through a secure connection. Only HDInsight cluster version 3.6 or newer supports this feature. For more information, see Create Apache Hadoop cluster with secure transfer storage accounts in Azure HDInsight.
Use additional storage accounts
While creating an HDInsight cluster, you specify the Azure Storage account you want to associate with it. In addition to this storage account, you can add additional storage accounts from the same Azure subscription or different Azure subscriptions during the creation process or after a cluster has been created. For instructions about adding additional storage accounts, see Create HDInsight clusters.
Using an additional storage account in a different location than the HDInsight cluster is not supported.
In this article, you learned how to use HDFS-compatible Azure storage with HDInsight. This allows you to build scalable, long-term, archiving data acquisition solutions and use HDInsight to unlock the information inside the stored structured and unstructured data.
For more information, see:
- Get started with Azure HDInsight
- Get started with Azure Data Lake Storage
- Upload data to HDInsight
- Use Apache Hive with HDInsight
- Use Azure Storage Shared Access Signatures to restrict access to data with HDInsight
- Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters
- Tutorial: Extract, transform, and load data using Interactive Query in Azure HDInsight