Use Azure PowerShell to create an HDInsight cluster with Data Lake Store (as additional storage)

Learn how to use Azure PowerShell to configure an HDInsight cluster with Azure Data Lake Store, as additional storage. For instructions on how to create an HDInsight cluster with Azure Data Lake Store as default storage, see Create an HDInsight cluster with Data Lake Store as default storage.

Note

If you are going to use Azure Data Lake Store as additional storage for HDInsight cluster, we strongly recommend that you do this while you create the cluster as described in this article. Adding Azure Data Lake Store as additional storage to an existing HDInsight cluster is a complicated process and prone to errors.

For supported cluster types, Data Lake Store can be used as a default storage or additional storage account. When Data Lake Store is used as additional storage, the default storage account for the clusters will still be Azure Storage Blobs (WASB) and the cluster-related files (such as logs, etc.) are still written to the default storage, while the data that you want to process can be stored in a Data Lake Store account. Using Data Lake Store as an additional storage account does not impact performance or the ability to read/write to the storage from the cluster.

Using Data Lake Store for HDInsight cluster storage

Here are some important considerations for using HDInsight with Data Lake Store:

  • Option to create HDInsight clusters with access to Data Lake Store as additional storage is available for HDInsight versions 3.2, 3.4, 3.5, and 3.6.

Configuring HDInsight to work with Data Lake Store using PowerShell involves the following steps:

  • Create an Azure Data Lake Store
  • Set up authentication for role-based access to Data Lake Store
  • Create HDInsight cluster with authentication to Data Lake Store
  • Run a test job on the cluster

Prerequisites

Before you begin this tutorial, you must have the following:

  • An Azure subscription. See Get Azure free trial.
  • Azure PowerShell 1.0 or greater. See How to install and configure Azure PowerShell.
  • Windows SDK. You can install it from here. You use this to create a security certificate.
  • Azure Active Directory Service Principal. Steps in this tutorial provide instructions on how to create a service principal in Azure AD. However, you must be an Azure AD administrator to be able to create a service principal. If you are an Azure AD administrator, you can skip this prerequisite and proceed with the tutorial.

    If you are not an Azure AD administrator, you will not be able to perform the steps required to create a service principal. In such a case, your Azure AD administrator must first create a service principal before you can create an HDInsight cluster with Data Lake Store. Also, the service principal must be created using a certificate, as described at Create a service principal with certificate.

Create an Azure Data Lake Store

Follow these steps to create a Data Lake Store.

  1. From your desktop, open a new Azure PowerShell window, and enter the following snippet. When prompted to log in, make sure you log in as one of the subscription administrator/owner:

     # Log in to your Azure account
     Login-AzureRmAccount
    
     # List all the subscriptions associated to your account
     Get-AzureRmSubscription
    
     # Select a subscription
     Set-AzureRmContext -SubscriptionId <subscription ID>
    
     # Register for Data Lake Store
     Register-AzureRmResourceProvider -ProviderNamespace "Microsoft.DataLakeStore"
    
    Note

    If you receive an error similar to Register-AzureRmResourceProvider : InvalidResourceNamespace: The resource namespace 'Microsoft.DataLakeStore' is invalid when registering the Data Lake Store resource provider, it is possible that your subscription is not whitelisted for Azure Data Lake Store. Make sure you enable your Azure subscription for Data Lake Store public preview by following these instructions.

  2. An Azure Data Lake Store account is associated with an Azure Resource Group. Start by creating an Azure Resource Group.

     $resourceGroupName = "<your new resource group name>"
     New-AzureRmResourceGroup -Name $resourceGroupName -Location "East US 2"
    

    You should see an output like this:

     ResourceGroupName : hdiadlgrp
     Location          : eastus2
     ProvisioningState : Succeeded
     Tags              :
     ResourceId        : /subscriptions/<subscription-id>/resourceGroups/hdiadlgrp
    
  3. Create an Azure Data Lake Store account. The account name you specify must only contain lowercase letters and numbers.

     $dataLakeStoreName = "<your new Data Lake Store name>"
     New-AzureRmDataLakeStoreAccount -ResourceGroupName $resourceGroupName -Name $dataLakeStoreName -Location "East US 2"
    

    You should see an output like the following:

     ...
     ProvisioningState           : Succeeded
     State                       : Active
     CreationTime                : 5/5/2017 10:53:56 PM
     EncryptionState             : Enabled
     ...
     LastModifiedTime            : 5/5/2017 10:53:56 PM
     Endpoint                    : hdiadlstore.azuredatalakestore.net
     DefaultGroup                :
     Id                          : /subscriptions/<subscription-id>/resourceGroups/hdiadlgrp/providers/Microsoft.DataLakeStore/accounts/hdiadlstore
     Name                        : hdiadlstore
     Type                        : Microsoft.DataLakeStore/accounts
     Location                    : East US 2
     Tags                        : {}
    
  4. Upload some sample data to Azure Data Lake. We'll use this later in this article to verify that the data is accessible from an HDInsight cluster. If you are looking for some sample data to upload, you can get the Ambulance Data folder from the Azure Data Lake Git Repository.

     $myrootdir = "/"
     Import-AzureRmDataLakeStoreItem -AccountName $dataLakeStoreName -Path "C:\<path to data>\vehicle1_09142014.csv" -Destination $myrootdir\vehicle1_09142014.csv
    

Set up authentication for role-based access to Data Lake Store

Every Azure subscription is associated with an Azure Active Directory. Users and services that access resources of the subscription using the Azure Classic Portal or Azure Resource Manager API must first authenticate with that Azure Active Directory. Access is granted to Azure subscriptions and services by assigning them the appropriate role on an Azure resource. For services, a service principal identifies the service in the Azure Active Directory (AAD). This section illustrates how to grant an application service, like HDInsight, access to an Azure resource (the Azure Data Lake Store account you created earlier) by creating a service principal for the application and assigning roles to that via Azure PowerShell.

To set up Active Directory authentication for Azure Data Lake, you must perform the following tasks.

  • Create a self-signed certificate
  • Create an application in Azure Active Directory and a Service Principal

Create a self-signed certificate

Make sure you have Windows SDK installed before proceeding with the steps in this section. You must have also created a directory, such as C:\mycertdir, where the certificate will be created.

  1. From the PowerShell window, navigate to the location where you installed Windows SDK (typically, C:\Program Files (x86)\Windows Kits\10\bin\x86 and use the MakeCert utility to create a self-signed certificate and a private key. Use the following commands.

     $certificateFileDir = "<my certificate directory>"
     cd $certificateFileDir
    
     makecert -sv mykey.pvk -n "cn=HDI-ADL-SP" CertFile.cer -r -len 2048
    

    You will be prompted to enter the private key password. After the command successfully executes, you should see a CertFile.cer and mykey.pvk in the certificate directory you specified.

  2. Use the Pvk2Pfx utility to convert the .pvk and .cer files that MakeCert created to a .pfx file. Run the following command.

     pvk2pfx -pvk mykey.pvk -spc CertFile.cer -pfx CertFile.pfx -po <password>
    

    When prompted enter the private key password you specified earlier. The value you specify for the -po parameter is the password that is associated with the .pfx file. After the command successfully completes, you should also see a CertFile.pfx in the certificate directory you specified.

Create an Azure Active Directory and a service principal

In this section, you perform the steps to create a service principal for an Azure Active Directory application, assign a role to the service principal, and authenticate as the service principal by providing a certificate. Run the following commands to create an application in Azure Active Directory.

  1. Paste the following cmdlets in the PowerShell console window. Make sure the value you specify for the -DisplayName property is unique. Also, the values for -HomePage and -IdentiferUris are placeholder values and are not verified.

     $certificateFilePath = "$certificateFileDir\CertFile.pfx"
    
     $password = Read-Host –Prompt "Enter the password" # This is the password you specified for the .pfx file
    
     $certificatePFX = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2($certificateFilePath, $password)
    
     $rawCertificateData = $certificatePFX.GetRawCertData()
    
     $credential = [System.Convert]::ToBase64String($rawCertificateData)
    
     $application = New-AzureRmADApplication `
         -DisplayName "HDIADL" `
         -HomePage "https://contoso.com" `
         -IdentifierUris "https://mycontoso.com" `
         -CertValue $credential  `
         -StartDate $certificatePFX.NotBefore  `
         -EndDate $certificatePFX.NotAfter
    
     $applicationId = $application.ApplicationId
    
  2. Create a service principal using the application ID.

     $servicePrincipal = New-AzureRmADServicePrincipal -ApplicationId $applicationId
    
     $objectId = $servicePrincipal.Id
    
  3. Grant the service principal access to the Data Lake Store folder and the file that you will access from the HDInsight cluster. The snippet below provides access to the root of the Data Lake Store account (where you copied the sample data file), and the file itself.

     Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path / -AceType User -Id $objectId -Permissions All
     Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path /vehicle1_09142014.csv -AceType User -Id $objectId -Permissions All
    

Create an HDInsight Linux cluster with Data Lake Store as additional storage

In this section, we create an HDInsight Hadoop Linux cluster with Data Lake Store as additional storage. For this release, the HDInsight cluster and the Data Lake Store must be in the same location.

  1. Start with retrieving the subscription tenant ID. You will need that later.

     $tenantID = (Get-AzureRmContext).Tenant.TenantId
    
  2. For this release, for a Hadoop cluster, Data Lake Store can only be used as an additional storage for the cluster. The default storage will still be the Azure storage blobs (WASB). So, we'll first create the storage account and storage containers required for the cluster.

     # Create an Azure storage account
     $location = "East US 2"
     $storageAccountName = "<StorageAcccountName>"   # Provide a Storage account name
    
     New-AzureRmStorageAccount -ResourceGroupName $resourceGroupName -StorageAccountName $storageAccountName -Location $location -Type Standard_GRS
    
     # Create an Azure Blob Storage container
     $containerName = "<ContainerName>"              # Provide a container name
     $storageAccountKey = (Get-AzureRmStorageAccountKey -Name $storageAccountName -ResourceGroupName $resourceGroupName)[0].Value
     $destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
     New-AzureStorageContainer -Name $containerName -Context $destContext
    
  3. Create the HDInsight cluster. Use the following cmdlets.

     # Set these variables
     $clusterName = $containerName                   # As a best practice, have the same name for the cluster and container
     $clusterNodes = <ClusterSizeInNodes>            # The number of nodes in the HDInsight cluster
     $httpCredentials = Get-Credential
     $sshCredentials = Get-Credential
    
     New-AzureRmHDInsightCluster -ClusterName $clusterName -ResourceGroupName $resourceGroupName -HttpCredential $httpCredentials -Location $location -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainer $containerName  -ClusterSizeInNodes $clusterNodes -ClusterType Hadoop -Version "3.4" -OSType Linux -SshCredential $sshCredentials -ObjectID $objectId -AadTenantId $tenantID -CertificateFilePath $certificateFilePath -CertificatePassword $password
    

    After the cmdlet successfully completes, you should see an output listing the cluster details.

Run test jobs on the HDInsight cluster to use the Data Lake Store

After you have configured an HDInsight cluster, you can run test jobs on the cluster to test that the HDInsight cluster can access Data Lake Store. To do so, we will run a sample Hive job that creates a table using the sample data that you uploaded earlier to your Data Lake Store.

In this section you will SSH into the HDInsight Linux cluster you created and run the a sample Hive query.

  1. Once connected, start the Hive CLI by using the following command:

     hive
    
  2. Using the CLI, enter the following statements to create a new table named vehicles by using the sample data in the Data Lake Store:

     DROP TABLE vehicles;
     CREATE EXTERNAL TABLE vehicles (str string) LOCATION 'adl://<mydatalakestore>.azuredatalakestore.net:443/';
     SELECT * FROM vehicles LIMIT 10;
    

    You should see an output similar to the following:

     1,1,2014-09-14 00:00:03,46.81006,-92.08174,51,S,1
     1,2,2014-09-14 00:00:06,46.81006,-92.08174,13,NE,1
     1,3,2014-09-14 00:00:09,46.81006,-92.08174,48,NE,1
     1,4,2014-09-14 00:00:12,46.81006,-92.08174,30,W,1
     1,5,2014-09-14 00:00:15,46.81006,-92.08174,47,S,1
     1,6,2014-09-14 00:00:18,46.81006,-92.08174,9,S,1
     1,7,2014-09-14 00:00:21,46.81006,-92.08174,53,N,1
     1,8,2014-09-14 00:00:24,46.81006,-92.08174,63,SW,1
     1,9,2014-09-14 00:00:27,46.81006,-92.08174,4,NE,1
     1,10,2014-09-14 00:00:30,46.81006,-92.08174,31,N,1
    

Access Data Lake Store using HDFS commands

Once you have configured the HDInsight cluster to use Data Lake Store, you can use the HDFS shell commands to access the store.

In this section you will SSH into the HDInsight Linux cluster you created and run the HDFS commands.

Once connected, use the following HDFS filesystem command to list the files in the Data Lake Store.

hdfs dfs -ls adl://<Data Lake Store account name>.azuredatalakestore.net:443/

This should list the file that you uploaded earlier to the Data Lake Store.

15/09/17 21:41:15 INFO web.CaboWebHdfsFileSystem: Replacing original urlConnectionFactory with org.apache.hadoop.hdfs.web.URLConnectionFactory@21a728d6
Found 1 items
-rwxrwxrwx   0 NotSupportYet NotSupportYet     671388 2015-09-16 22:16 adl://mydatalakestore.azuredatalakestore.net:443/mynewfolder

You can also use the hdfs dfs -put command to upload some files to the Data Lake Store, and then use hdfs dfs -ls to verify whether the files were successfully uploaded.

See Also