Create an HDInsight cluster with Azure Data Lake Storage Gen1 using Azure Resource Manager template

Learn how to use Azure PowerShell to configure an HDInsight cluster with Azure Data Lake Storage Gen1, as additional storage.

For supported cluster types, Data Lake Storage Gen1 can be used as an default storage or additional storage account. When Data Lake Storage Gen1 is used as additional storage, the default storage account for the clusters will still be Azure Storage Blobs (WASB) and the cluster-related files (such as logs, etc.) are still written to the default storage, while the data that you want to process can be stored in a Data Lake Storage Gen1 account. Using Data Lake Storage Gen1 as an additional storage account does not impact performance or the ability to read/write to the storage from the cluster.

Using Data Lake Storage Gen1 for HDInsight cluster storage

Here are some important considerations for using HDInsight with Data Lake Storage Gen1:

  • Option to create HDInsight clusters with access to Data Lake Storage Gen1 as default storage is available for HDInsight version 3.5 and 3.6.

  • Option to create HDInsight clusters with access to Data Lake Storage Gen1 as additional storage is available for HDInsight versions 3.2, 3.4, 3.5, and 3.6.

In this article, we provision a Hadoop cluster with Data Lake Storage Gen1 as additional storage. For instructions on how to create a Hadoop cluster with Data Lake Storage Gen1 as default storage, see Create an HDInsight cluster with Data Lake Storage Gen1 using Azure Portal.

Prerequisites

Note

This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure PowerShell.

Before you begin this tutorial, you must have the following:

  • An Azure subscription. See Get Azure free trial.

  • Azure PowerShell 1.0 or greater. See How to install and configure Azure PowerShell.

  • Azure Active Directory Service Principal. Steps in this tutorial provide instructions on how to create a service principal in Azure AD. However, you must be an Azure AD administrator to be able to create a service principal. If you are an Azure AD administrator, you can skip this prerequisite and proceed with the tutorial.

    If you are not an Azure AD administrator, you will not be able to perform the steps required to create a service principal. In such a case, your Azure AD administrator must first create a service principal before you can create an HDInsight cluster with Data Lake Storage Gen1. Also, the service principal must be created using a certificate, as described at Create a service principal with certificate.

Create an HDInsight cluster with Data Lake Storage Gen1

The Resource Manager template, and the prerequisites for using the template, are available on GitHub at Deploy a HDInsight Linux cluster with new Data Lake Storage Gen1. Follow the instructions provided at this link to create an HDInsight cluster with Data Lake Storage Gen1 as the additional storage.

The instructions at the link mentioned above require PowerShell. Before you start with those instructions, make sure you log in to your Azure account. From your desktop, open a new Azure PowerShell window, and enter the following snippets. When prompted to log in, make sure you log in as one of the subscription administrators/owner:

# Log in to your Azure account
Connect-AzAccount

# List all the subscriptions associated to your account
Get-AzSubscription

# Select a subscription
Set-AzContext -SubscriptionId <subscription ID>

The template deploys these resource types:

Upload sample data to Data Lake Storage Gen1

The Resource Manager template creates a new Data Lake Storage Gen1 account and associates it with the HDInsight cluster. You must now upload some sample data to Data Lake Storage Gen1. You'll need this data later in the tutorial to run jobs from an HDInsight cluster that access data in the Data Lake Storage Gen1 account. For instructions on how to upload data, see Upload a file to your Data Lake Storage Gen1 account. If you are looking for some sample data to upload, you can get the Ambulance Data folder from the Azure Data Lake Git Repository.

Set relevant ACLs on the sample data

To make sure the sample data you upload is accessible from the HDInsight cluster, you must ensure that the Azure AD application that is used to establish identity between the HDInsight cluster and Data Lake Storage Gen1 has access to the file/folder you are trying to access. To do this, perform the following steps.

  1. Find the name of the Azure AD application that is associated with HDInsight cluster and the Data Lake Storage Gen1 account. One way to look for the name is to open the HDInsight cluster blade that you created using the Resource Manager template, click the Cluster AAD Identity tab, and look for the value of Service Principal Display Name.
  2. Now, provide access to this Azure AD application on the file/folder that you want to access from the HDInsight cluster. To set the right ACLs on the file/folder in Data Lake Storage Gen1, see Securing data in Data Lake Storage Gen1.

Run test jobs on the HDInsight cluster to use Data Lake Storage Gen1

After you have configured an HDInsight cluster, you can run test jobs on the cluster to test that the HDInsight cluster can access Data Lake Storage Gen1. To do so, we will run a sample Hive job that creates a table using the sample data that you uploaded earlier to your Data Lake Storage Gen1 account.

In this section you will SSH into an HDInsight Linux cluster and run the a sample Hive query. If you are using a Windows client, we recommend using PuTTY, which can be downloaded from https://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

For more information on using PuTTY, see Use SSH with Linux-based Hadoop on HDInsight from Windows.

  1. Once connected, start the Hive CLI by using the following command:

    hive
    
  2. Using the CLI, enter the following statements to create a new table named vehicles by using the sample data in Data Lake Storage Gen1:

    DROP TABLE vehicles;
    CREATE EXTERNAL TABLE vehicles (str string) LOCATION 'adl://<mydatalakestoragegen1>.azuredatalakestore.net:443/';
    SELECT * FROM vehicles LIMIT 10;
    

    You should see an output similar to the following:

    1,1,2014-09-14 00:00:03,46.81006,-92.08174,51,S,1
    1,2,2014-09-14 00:00:06,46.81006,-92.08174,13,NE,1
    1,3,2014-09-14 00:00:09,46.81006,-92.08174,48,NE,1
    1,4,2014-09-14 00:00:12,46.81006,-92.08174,30,W,1
    1,5,2014-09-14 00:00:15,46.81006,-92.08174,47,S,1
    1,6,2014-09-14 00:00:18,46.81006,-92.08174,9,S,1
    1,7,2014-09-14 00:00:21,46.81006,-92.08174,53,N,1
    1,8,2014-09-14 00:00:24,46.81006,-92.08174,63,SW,1
    1,9,2014-09-14 00:00:27,46.81006,-92.08174,4,NE,1
    1,10,2014-09-14 00:00:30,46.81006,-92.08174,31,N,1
    

Access Data Lake Storage Gen1 using HDFS commands

Once you have configured the HDInsight cluster to use Data Lake Storage Gen1, you can use the HDFS shell commands to access the store.

In this section you will SSH into an HDInsight Linux cluster and run the HDFS commands. If you are using a Windows client, we recommend using PuTTY, which can be downloaded from https://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

For more information on using PuTTY, see Use SSH with Linux-based Hadoop on HDInsight from Windows.

Once connected, use the following HDFS filesystem command to list the files in the Data Lake Storage Gen1 account.

hdfs dfs -ls adl://<Data Lake Storage Gen1 account name>.azuredatalakestore.net:443/

This should list the file that you uploaded earlier to Data Lake Storage Gen1.

15/09/17 21:41:15 INFO web.CaboWebHdfsFileSystem: Replacing original urlConnectionFactory with org.apache.hadoop.hdfs.web.URLConnectionFactory@21a728d6
Found 1 items
-rwxrwxrwx   0 NotSupportYet NotSupportYet     671388 2015-09-16 22:16 adl://mydatalakestoragegen1.azuredatalakestore.net:443/mynewfolder

You can also use the hdfs dfs -put command to upload some files to Data Lake Storage Gen1, and then use hdfs dfs -ls to verify whether the files were successfully uploaded.

Next steps