Use Azure PowerShell to create an HDInsight cluster with Azure Data Lake Storage Gen1 (as additional storage)

Learn how to use Azure PowerShell to configure an HDInsight cluster with Azure Data Lake Storage Gen1, as additional storage. For instructions on how to create an HDInsight cluster with Data Lake Storage Gen1 as default storage, see Create an HDInsight cluster with Data Lake Storage Gen1 as default storage.

Note

If you are going to use Data Lake Storage Gen1 as additional storage for HDInsight cluster, we strongly recommend that you do this while you create the cluster as described in this article. Adding Data Lake Storage Gen1 as additional storage to an existing HDInsight cluster is a complicated process and prone to errors.

For supported cluster types, Data Lake Storage Gen1 can be used as a default storage or additional storage account. When Data Lake Storage Gen1 is used as additional storage, the default storage account for the clusters will still be Azure Storage Blobs (WASB) and the cluster-related files (such as logs, etc.) are still written to the default storage, while the data that you want to process can be stored in a Data Lake Storage Gen1 account. Using Data Lake Storage Gen1 as an additional storage account does not impact performance or the ability to read/write to the storage from the cluster.

Using Data Lake Storage Gen1 for HDInsight cluster storage

Here are some important considerations for using HDInsight with Data Lake Storage Gen1:

  • Option to create HDInsight clusters with access to Data Lake Storage Gen1 as additional storage is available for HDInsight versions 3.2, 3.4, 3.5, and 3.6.

Configuring HDInsight to work with Data Lake Storage Gen1 using PowerShell involves the following steps:

  • Create a Data Lake Storage Gen1 account
  • Set up authentication for role-based access to Data Lake Storage Gen1
  • Create HDInsight cluster with authentication to Data Lake Storage Gen1
  • Run a test job on the cluster

Prerequisites

Note

This article has been updated to use the new Azure PowerShell Az module. You can still use the AzureRM module, which will continue to receive bug fixes until at least December 2020. To learn more about the new Az module and AzureRM compatibility, see Introducing the new Azure PowerShell Az module. For Az module installation instructions, see Install Azure PowerShell.

Before you begin this tutorial, you must have the following:

  • An Azure subscription. See Get Azure free trial.

  • Azure PowerShell 1.0 or greater. See How to install and configure Azure PowerShell.

  • Windows SDK. You can install it from here. You use this to create a security certificate.

  • Azure Active Directory Service Principal. Steps in this tutorial provide instructions on how to create a service principal in Azure AD. However, you must be an Azure AD administrator to be able to create a service principal. If you are an Azure AD administrator, you can skip this prerequisite and proceed with the tutorial.

    If you are not an Azure AD administrator, you will not be able to perform the steps required to create a service principal. In such a case, your Azure AD administrator must first create a service principal before you can create an HDInsight cluster with Data Lake Storage Gen1. Also, the service principal must be created using a certificate, as described at Create a service principal with certificate.

Create a Data Lake Storage Gen1 account

Follow these steps to create a Data Lake Storage Gen1 account.

  1. From your desktop, open a new Azure PowerShell window, and enter the following snippet. When prompted to log in, make sure you log in as one of the subscription administrator/owner:

     # Log in to your Azure account
     Connect-AzAccount
    
     # List all the subscriptions associated to your account
     Get-AzSubscription
    
     # Select a subscription
     Set-AzContext -SubscriptionId <subscription ID>
    
     # Register for Data Lake Storage Gen1
     Register-AzResourceProvider -ProviderNamespace "Microsoft.DataLakeStore"
    

    Note

    If you receive an error similar to Register-AzResourceProvider : InvalidResourceNamespace: The resource namespace 'Microsoft.DataLakeStore' is invalid when registering the Data Lake Storage Gen1 resource provider, it is possible that your subscription is not whitelisted for Data Lake Storage Gen1. Make sure you enable your Azure subscription for Data Lake Storage Gen1 by following these instructions.

  2. A Data Lake Storage Gen1 account is associated with an Azure Resource Group. Start by creating an Azure Resource Group.

     $resourceGroupName = "<your new resource group name>"
     New-AzResourceGroup -Name $resourceGroupName -Location "East US 2"
    

    You should see an output like this:

     ResourceGroupName : hdiadlgrp
     Location          : eastus2
     ProvisioningState : Succeeded
     Tags              :
     ResourceId        : /subscriptions/<subscription-id>/resourceGroups/hdiadlgrp
    
  3. Create a Data Lake Storage Gen1 account. The account name you specify must only contain lowercase letters and numbers.

     $dataLakeStorageGen1Name = "<your new Data Lake Storage Gen1 account name>"
     New-AzDataLakeStoreAccount -ResourceGroupName $resourceGroupName -Name $dataLakeStorageGen1Name -Location "East US 2"
    

    You should see an output like the following:

     ...
     ProvisioningState           : Succeeded
     State                       : Active
     CreationTime                : 5/5/2017 10:53:56 PM
     EncryptionState             : Enabled
     ...
     LastModifiedTime            : 5/5/2017 10:53:56 PM
     Endpoint                    : hdiadlstore.azuredatalakestore.net
     DefaultGroup                :
     Id                          : /subscriptions/<subscription-id>/resourceGroups/hdiadlgrp/providers/Microsoft.DataLakeStore/accounts/hdiadlstore
     Name                        : hdiadlstore
     Type                        : Microsoft.DataLakeStore/accounts
     Location                    : East US 2
     Tags                        : {}
    
  4. Upload some sample data to Data Lake Storage Gen1. We'll use this later in this article to verify that the data is accessible from an HDInsight cluster. If you are looking for some sample data to upload, you can get the Ambulance Data folder from the Azure Data Lake Git Repository.

     $myrootdir = "/"
     Import-AzDataLakeStoreItem -AccountName $dataLakeStorageGen1Name -Path "C:\<path to data>\vehicle1_09142014.csv" -Destination $myrootdir\vehicle1_09142014.csv
    

Set up authentication for role-based access to Data Lake Storage Gen1

Every Azure subscription is associated with an Azure Active Directory. Users and services that access resources of the subscription using the Azure portal or Azure Resource Manager API must first authenticate with that Azure Active Directory. Access is granted to Azure subscriptions and services by assigning them the appropriate role on an Azure resource. For services, a service principal identifies the service in the Azure Active Directory (AAD). This section illustrates how to grant an application service, like HDInsight, access to an Azure resource (the Data Lake Storage Gen1 account you created earlier) by creating a service principal for the application and assigning roles to that via Azure PowerShell.

To set up Active Directory authentication for Data Lake Storage Gen1, you must perform the following tasks.

  • Create a self-signed certificate
  • Create an application in Azure Active Directory and a Service Principal

Create a self-signed certificate

Make sure you have Windows SDK installed before proceeding with the steps in this section. You must have also created a directory, such as C:\mycertdir, where the certificate will be created.

  1. From the PowerShell window, navigate to the location where you installed Windows SDK (typically, C:\Program Files (x86)\Windows Kits\10\bin\x86 and use the MakeCert utility to create a self-signed certificate and a private key. Use the following commands.

     $certificateFileDir = "<my certificate directory>"
     cd $certificateFileDir
    
     makecert -sv mykey.pvk -n "cn=HDI-ADL-SP" CertFile.cer -r -len 2048
    

    You will be prompted to enter the private key password. After the command successfully executes, you should see a CertFile.cer and mykey.pvk in the certificate directory you specified.

  2. Use the Pvk2Pfx utility to convert the .pvk and .cer files that MakeCert created to a .pfx file. Run the following command.

     pvk2pfx -pvk mykey.pvk -spc CertFile.cer -pfx CertFile.pfx -po <password>
    

    When prompted enter the private key password you specified earlier. The value you specify for the -po parameter is the password that is associated with the .pfx file. After the command successfully completes, you should also see a CertFile.pfx in the certificate directory you specified.

Create an Azure Active Directory and a service principal

In this section, you perform the steps to create a service principal for an Azure Active Directory application, assign a role to the service principal, and authenticate as the service principal by providing a certificate. Run the following commands to create an application in Azure Active Directory.

  1. Paste the following cmdlets in the PowerShell console window. Make sure the value you specify for the -DisplayName property is unique. Also, the values for -HomePage and -IdentiferUris are placeholder values and are not verified.

     $certificateFilePath = "$certificateFileDir\CertFile.pfx"
    
     $password = Read-Host -Prompt "Enter the password" # This is the password you specified for the .pfx file
    
     $certificatePFX = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2($certificateFilePath, $password)
    
     $rawCertificateData = $certificatePFX.GetRawCertData()
    
     $credential = [System.Convert]::ToBase64String($rawCertificateData)
    
     $application = New-AzADApplication `
         -DisplayName "HDIADL" `
         -HomePage "https://contoso.com" `
         -IdentifierUris "https://mycontoso.com" `
         -CertValue $credential  `
         -StartDate $certificatePFX.NotBefore  `
         -EndDate $certificatePFX.NotAfter
    
     $applicationId = $application.ApplicationId
    
  2. Create a service principal using the application ID.

     $servicePrincipal = New-AzADServicePrincipal -ApplicationId $applicationId
    
     $objectId = $servicePrincipal.Id
    
  3. Grant the service principal access to the Data Lake Storage Gen1 folder and the file that you will access from the HDInsight cluster. The snippet below provides access to the root of the Data Lake Storage Gen1 account (where you copied the sample data file), and the file itself.

     Set-AzDataLakeStoreItemAclEntry -AccountName $dataLakeStorageGen1Name -Path / -AceType User -Id $objectId -Permissions All
     Set-AzDataLakeStoreItemAclEntry -AccountName $dataLakeStorageGen1Name -Path /vehicle1_09142014.csv -AceType User -Id $objectId -Permissions All
    

Create an HDInsight Linux cluster with Data Lake Storage Gen1 as additional storage

In this section, we create an HDInsight Hadoop Linux cluster with Data Lake Storage Gen1 as additional storage. For this release, the HDInsight cluster and Data Lake Storage Gen1 account must be in the same location.

  1. Start with retrieving the subscription tenant ID. You will need that later.

     $tenantID = (Get-AzContext).Tenant.TenantId
    
  2. For this release, for a Hadoop cluster, Data Lake Storage Gen1 can only be used as an additional storage for the cluster. The default storage will still be the Azure storage blobs (WASB). So, we'll first create the storage account and storage containers required for the cluster.

     # Create an Azure storage account
     $location = "East US 2"
     $storageAccountName = "<StorageAccountName>"   # Provide a Storage account name
    
     New-AzStorageAccount -ResourceGroupName $resourceGroupName -StorageAccountName $storageAccountName -Location $location -Type Standard_GRS
    
     # Create an Azure Blob Storage container
     $containerName = "<ContainerName>"              # Provide a container name
     $storageAccountKey = (Get-AzStorageAccountKey -Name $storageAccountName -ResourceGroupName $resourceGroupName)[0].Value
     $destContext = New-AzStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
     New-AzStorageContainer -Name $containerName -Context $destContext
    
  3. Create the HDInsight cluster. Use the following cmdlets.

     # Set these variables
     $clusterName = $containerName                   # As a best practice, have the same name for the cluster and container
     $clusterNodes = <ClusterSizeInNodes>            # The number of nodes in the HDInsight cluster
     $httpCredentials = Get-Credential
     $sshCredentials = Get-Credential
    
     New-AzHDInsightCluster -ClusterName $clusterName -ResourceGroupName $resourceGroupName -HttpCredential $httpCredentials -Location $location -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainer $containerName  -ClusterSizeInNodes $clusterNodes -ClusterType Hadoop -Version "3.4" -OSType Linux -SshCredential $sshCredentials -ObjectID $objectId -AadTenantId $tenantID -CertificateFilePath $certificateFilePath -CertificatePassword $password
    

    After the cmdlet successfully completes, you should see an output listing the cluster details.

Run test jobs on the HDInsight cluster to use the Data Lake Storage Gen1 account

After you have configured an HDInsight cluster, you can run test jobs on the cluster to test that the HDInsight cluster can access Data Lake Storage Gen1. To do so, we will run a sample Hive job that creates a table using the sample data that you uploaded earlier to your Data Lake Storage Gen1 account.

In this section you will SSH into the HDInsight Linux cluster you created and run the a sample Hive query.

  1. Once connected, start the Hive CLI by using the following command:

     hive
    
  2. Using the CLI, enter the following statements to create a new table named vehicles by using the sample data in Data Lake Storage Gen1:

     DROP TABLE vehicles;
     CREATE EXTERNAL TABLE vehicles (str string) LOCATION 'adl://<mydatalakestoragegen1>.azuredatalakestore.net:443/';
     SELECT * FROM vehicles LIMIT 10;
    

    You should see an output similar to the following:

     1,1,2014-09-14 00:00:03,46.81006,-92.08174,51,S,1
     1,2,2014-09-14 00:00:06,46.81006,-92.08174,13,NE,1
     1,3,2014-09-14 00:00:09,46.81006,-92.08174,48,NE,1
     1,4,2014-09-14 00:00:12,46.81006,-92.08174,30,W,1
     1,5,2014-09-14 00:00:15,46.81006,-92.08174,47,S,1
     1,6,2014-09-14 00:00:18,46.81006,-92.08174,9,S,1
     1,7,2014-09-14 00:00:21,46.81006,-92.08174,53,N,1
     1,8,2014-09-14 00:00:24,46.81006,-92.08174,63,SW,1
     1,9,2014-09-14 00:00:27,46.81006,-92.08174,4,NE,1
     1,10,2014-09-14 00:00:30,46.81006,-92.08174,31,N,1
    

Access Data Lake Storage Gen1 using HDFS commands

Once you have configured the HDInsight cluster to use Data Lake Storage Gen1, you can use the HDFS shell commands to access the store.

In this section you will SSH into the HDInsight Linux cluster you created and run the HDFS commands.

Once connected, use the following HDFS filesystem command to list the files in the Data Lake Storage Gen1 account.

hdfs dfs -ls adl://<Data Lake Storage Gen1 account name>.azuredatalakestore.net:443/

This should list the file that you uploaded earlier to Data Lake Storage Gen1.

15/09/17 21:41:15 INFO web.CaboWebHdfsFileSystem: Replacing original urlConnectionFactory with org.apache.hadoop.hdfs.web.URLConnectionFactory@21a728d6
Found 1 items
-rwxrwxrwx   0 NotSupportYet NotSupportYet     671388 2015-09-16 22:16 adl://mydatalakestoragegen1.azuredatalakestore.net:443/mynewfolder

You can also use the hdfs dfs -put command to upload some files to Data Lake Storage Gen1, and then use hdfs dfs -ls to verify whether the files were successfully uploaded.

See Also