Use Azure PowerShell to create an HDInsight cluster with Azure Data Lake Storage Gen1 (as additional storage)

Learn how to use Azure PowerShell to configure an HDInsight cluster with Azure Data Lake Storage Gen1, as additional storage. For instructions on how to create an HDInsight cluster with Data Lake Storage Gen1 as default storage, see Create an HDInsight cluster with Data Lake Storage Gen1 as default storage.

Note

If you are going to use Data Lake Storage Gen1 as additional storage for HDInsight cluster, we strongly recommend that you do this while you create the cluster as described in this article. Adding Data Lake Storage Gen1 as additional storage to an existing HDInsight cluster is a complicated process and prone to errors.

For supported cluster types, Data Lake Storage Gen1 can be used as a default storage or additional storage account. When Data Lake Storage Gen1 is used as additional storage, the default storage account for the clusters will still be Azure Blob storage (WASB) and the cluster-related files (such as logs, etc.) are still written to the default storage, while the data that you want to process can be stored in a Data Lake Storage Gen1. Using Data Lake Storage Gen1 as an additional storage account does not impact performance or the ability to read/write to the storage from the cluster.

Using Data Lake Storage Gen1 for HDInsight cluster storage

Here are some important considerations for using HDInsight with Data Lake Storage Gen1:

  • Option to create HDInsight clusters with access to Data Lake Storage Gen1 as additional storage is available for HDInsight versions 3.2, 3.4, 3.5, and 3.6.

Configuring HDInsight to work with Data Lake Storage Gen1 using PowerShell involves the following steps:

  • Create a Data Lake Storage Gen1 account
  • Set up authentication for role-based access to Data Lake Storage Gen1
  • Create HDInsight cluster with authentication to Data Lake Storage Gen1
  • Run a test job on the cluster

Prerequisites

Note

We recommend that you use the Azure Az PowerShell module to interact with Azure. See Install Azure PowerShell to get started. To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.

Before you begin this tutorial, you must have the following:

  • An Azure subscription. See Get Azure free trial.

  • Azure PowerShell 1.0 or greater. See How to install and configure Azure PowerShell.

  • Windows SDK. You can install it from here. You use this to create a security certificate.

  • Microsoft Entra service principal. Steps in this tutorial provide instructions on how to create a service principal in Microsoft Entra ID. However, you must be a Microsoft Entra administrator to be able to create a service principal. If you are a Microsoft Entra administrator, you can skip this prerequisite and proceed with the tutorial.

    If you are not a Microsoft Entra administrator, you will not be able to perform the steps required to create a service principal. In such a case, your Microsoft Entra administrator must first create a service principal before you can create an HDInsight cluster with Data Lake Storage Gen1. Also, the service principal must be created using a certificate, as described at Create a service principal with certificate.

Create a Data Lake Storage Gen1 account

Follow these steps to create a Data Lake Storage Gen1 account.

  1. From your desktop, open a new Azure PowerShell window, and enter the following snippet. When prompted to log in, make sure you log in as one of the subscription administrator/owner:

    # Log in to your Azure account
    Connect-AzAccount
    
    # List all the subscriptions associated to your account
    Get-AzSubscription
    
    # Select a subscription
    Set-AzContext -SubscriptionId <subscription ID>
    
    # Register for Data Lake Storage Gen1
    Register-AzResourceProvider -ProviderNamespace "Microsoft.DataLakeStore"
    

    Note

    If you receive an error similar to Register-AzResourceProvider : InvalidResourceNamespace: The resource namespace 'Microsoft.DataLakeStore' is invalid when registering the Data Lake Storage Gen1 resource provider, it is possible that your subscription is not approved for Data Lake Storage Gen1. Make sure you enable your Azure subscription for Data Lake Storage Gen1 by following these instructions.

  2. A storage account with Data Lake Storage Gen1 is associated with an Azure Resource Group. Start by creating an Azure Resource Group.

    $resourceGroupName = "<your new resource group name>"
    New-AzResourceGroup -Name $resourceGroupName -Location "East US 2"
    

    You should see an output like this:

    ResourceGroupName : hdiadlgrp
    Location          : eastus2
    ProvisioningState : Succeeded
    Tags              :
    ResourceId        : /subscriptions/<subscription-id>/resourceGroups/hdiadlgrp
    
  3. Create a storage account with Data Lake Storage Gen1. The account name you specify must only contain lowercase letters and numbers.

    $dataLakeStorageGen1Name = "<your new storage account with Data Lake Storage Gen1 name>"
    New-AzDataLakeStoreAccount -ResourceGroupName $resourceGroupName -Name $dataLakeStorageGen1Name -Location "East US 2"
    

    You should see an output like the following:

    ...
    ProvisioningState           : Succeeded
    State                       : Active
    CreationTime                : 5/5/2017 10:53:56 PM
    EncryptionState             : Enabled
    ...
    LastModifiedTime            : 5/5/2017 10:53:56 PM
    Endpoint                    : hdiadlstore.azuredatalakestore.net
    DefaultGroup                :
    Id                          : /subscriptions/<subscription-id>/resourceGroups/hdiadlgrp/providers/Microsoft.DataLakeStore/accounts/hdiadlstore
    Name                        : hdiadlstore
    Type                        : Microsoft.DataLakeStore/accounts
    Location                    : East US 2
    Tags                        : {}
    
  4. Upload some sample data to Data Lake Storage Gen1. We'll use this later in this article to verify that the data is accessible from an HDInsight cluster. If you are looking for some sample data to upload, you can get the Ambulance Data folder from the Azure Data Lake Git Repository.

    $myrootdir = "/"
    Import-AzDataLakeStoreItem -AccountName $dataLakeStorageGen1Name -Path "C:\<path to data>\vehicle1_09142014.csv" -Destination $myrootdir\vehicle1_09142014.csv
    

Set up authentication for role-based access to Data Lake Storage Gen1

Every Azure subscription is associated with a Microsoft Entra ID. Users and services that access resources of the subscription using the Azure portal or Azure Resource Manager API must first authenticate with that Microsoft Entra ID. Access is granted to Azure subscriptions and services by assigning them the appropriate role on an Azure resource. For services, a service principal identifies the service in the Microsoft Entra ID. This section illustrates how to grant an application service, like HDInsight, access to an Azure resource (the storage account with Data Lake Storage Gen1 you created earlier) by creating a service principal for the application and assigning roles to that via Azure PowerShell.

To set up Active Directory authentication for Data Lake Storage Gen1, you must perform the following tasks.

  • Create a self-signed certificate
  • Create an application in Microsoft Entra ID and a Service Principal

Create a self-signed certificate

Make sure you have Windows SDK installed before proceeding with the steps in this section. You must have also created a directory, such as C:\mycertdir, where the certificate will be created.

  1. From the PowerShell window, navigate to the location where you installed Windows SDK (typically, C:\Program Files (x86)\Windows Kits\10\bin\x86 and use the MakeCert utility to create a self-signed certificate and a private key. Use the following commands.

    $certificateFileDir = "<my certificate directory>"
    cd $certificateFileDir
    
    makecert -sv mykey.pvk -n "cn=HDI-ADL-SP" CertFile.cer -r -len 2048
    

    You will be prompted to enter the private key password. After the command successfully executes, you should see a CertFile.cer and mykey.pvk in the certificate directory you specified.

  2. Use the Pvk2Pfx utility to convert the .pvk and .cer files that MakeCert created to a .pfx file. Run the following command.

    pvk2pfx -pvk mykey.pvk -spc CertFile.cer -pfx CertFile.pfx -po <password>
    

    When prompted enter the private key password you specified earlier. The value you specify for the -po parameter is the password that is associated with the .pfx file. After the command successfully completes, you should also see a CertFile.pfx in the certificate directory you specified.

Create a Microsoft Entra ID and a service principal

In this section, you perform the steps to create a service principal for a Microsoft Entra application, assign a role to the service principal, and authenticate as the service principal by providing a certificate. Run the following commands to create an application in Microsoft Entra ID.

  1. Paste the following cmdlets in the PowerShell console window. Make sure the value you specify for the -DisplayName property is unique. Also, the values for -HomePage and -IdentiferUris are placeholder values and are not verified.

    $certificateFilePath = "$certificateFileDir\CertFile.pfx"
    
    $password = Read-Host -Prompt "Enter the password" # This is the password you specified for the .pfx file
    
    $certificatePFX = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2($certificateFilePath, $password)
    
    $rawCertificateData = $certificatePFX.GetRawCertData()
    
    $credential = [System.Convert]::ToBase64String($rawCertificateData)
    
    $application = New-AzADApplication `
        -DisplayName "HDIADL" `
        -HomePage "https://contoso.com" `
        -IdentifierUris "https://contoso.com" `
        -CertValue $credential  `
        -StartDate $certificatePFX.NotBefore  `
        -EndDate $certificatePFX.NotAfter
    
    $applicationId = $application.ApplicationId
    
  2. Create a service principal using the application ID.

    $servicePrincipal = New-AzADServicePrincipal -ApplicationId $applicationId -Role Contributor
    
     $objectId = $servicePrincipal.Id
    
  3. Grant the service principal access to the Data Lake Storage Gen1 folder and the file that you will access from the HDInsight cluster. The snippet below provides access to the root of the storage account with Data Lake Storage Gen1 (where you copied the sample data file), and the file itself.

    Set-AzDataLakeStoreItemAclEntry -AccountName $dataLakeStorageGen1Name -Path / -AceType User -Id $objectId -Permissions All
    Set-AzDataLakeStoreItemAclEntry -AccountName $dataLakeStorageGen1Name -Path /vehicle1_09142014.csv -AceType User -Id $objectId -Permissions All
    

Create an HDInsight Linux cluster with Data Lake Storage Gen1 as additional storage

In this section, we create an HDInsight Hadoop Linux cluster with Data Lake Storage Gen1 as additional storage. For this release, the HDInsight cluster and storage account with Data Lake Storage Gen1 must be in the same location.

  1. Start with retrieving the subscription tenant ID. You will need that later.

    $tenantID = (Get-AzContext).Tenant.TenantId
    
  2. For this release, for a Hadoop cluster, Data Lake Storage Gen1 can only be used as an additional storage for the cluster. The default storage will still be the Azure Blob storage (WASB). So, we'll first create the storage account and storage containers required for the cluster.

    # Create an Azure storage account
    $location = "East US 2"
    $storageAccountName = "<StorageAccountName>"   # Provide a Storage account name
    
    New-AzStorageAccount -ResourceGroupName $resourceGroupName -StorageAccountName $storageAccountName -Location $location -Type Standard_GRS
    
    # Create an Azure Blob Storage container
    $containerName = "<ContainerName>"              # Provide a container name
    $storageAccountKey = (Get-AzStorageAccountKey -Name $storageAccountName -ResourceGroupName $resourceGroupName)[0].Value
    $destContext = New-AzStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
    New-AzStorageContainer -Name $containerName -Context $destContext
    
  3. Create the HDInsight cluster. Use the following cmdlets.

    # Set these variables
    $clusterName = $containerName                   # As a best practice, have the same name for the cluster and container
    $clusterNodes = <ClusterSizeInNodes>            # The number of nodes in the HDInsight cluster
    $httpCredentials = Get-Credential
    $sshCredentials = Get-Credential
    
    New-AzHDInsightCluster -ClusterName $clusterName -ResourceGroupName $resourceGroupName -HttpCredential $httpCredentials -Location $location -DefaultStorageAccountName "$storageAccountName.blob.core.windows.net" -DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainer $containerName  -ClusterSizeInNodes $clusterNodes -ClusterType Hadoop -Version "3.4" -OSType Linux -SshCredential $sshCredentials -ObjectID $objectId -AadTenantId $tenantID -CertificateFilePath $certificateFilePath -CertificatePassword $password
    

    After the cmdlet successfully completes, you should see an output listing the cluster details.

Run test jobs on the HDInsight cluster to use the Data Lake Storage Gen1

After you have configured an HDInsight cluster, you can run test jobs on the cluster to test that the HDInsight cluster can access Data Lake Storage Gen1. To do so, we will run a sample Hive job that creates a table using the sample data that you uploaded earlier to your storage account with Data Lake Storage Gen1.

In this section you will SSH into the HDInsight Linux cluster you created and run the a sample Hive query.

  1. Once connected, start the Hive CLI by using the following command:

    hive
    
  2. Using the CLI, enter the following statements to create a new table named vehicles by using the sample data in Data Lake Storage Gen1:

    DROP TABLE vehicles;
    CREATE EXTERNAL TABLE vehicles (str string) LOCATION 'adl://<mydatalakestoragegen1>.azuredatalakestore.net:443/';
    SELECT * FROM vehicles LIMIT 10;
    

    You should see an output similar to the following:

    1,1,2014-09-14 00:00:03,46.81006,-92.08174,51,S,1
    1,2,2014-09-14 00:00:06,46.81006,-92.08174,13,NE,1
    1,3,2014-09-14 00:00:09,46.81006,-92.08174,48,NE,1
    1,4,2014-09-14 00:00:12,46.81006,-92.08174,30,W,1
    1,5,2014-09-14 00:00:15,46.81006,-92.08174,47,S,1
    1,6,2014-09-14 00:00:18,46.81006,-92.08174,9,S,1
    1,7,2014-09-14 00:00:21,46.81006,-92.08174,53,N,1
    1,8,2014-09-14 00:00:24,46.81006,-92.08174,63,SW,1
    1,9,2014-09-14 00:00:27,46.81006,-92.08174,4,NE,1
    1,10,2014-09-14 00:00:30,46.81006,-92.08174,31,N,1
    

Access Data Lake Storage Gen1 using HDFS commands

Once you have configured the HDInsight cluster to use Data Lake Storage Gen1, you can use the HDFS shell commands to access the store.

In this section you will SSH into the HDInsight Linux cluster you created and run the HDFS commands.

Once connected, use the following HDFS filesystem command to list the files in the storage account with Data Lake Storage Gen1.

hdfs dfs -ls adl://<storage account with Data Lake Storage Gen1 name>.azuredatalakestore.net:443/

This should list the file that you uploaded earlier to Data Lake Storage Gen1.

15/09/17 21:41:15 INFO web.CaboWebHdfsFileSystem: Replacing original urlConnectionFactory with org.apache.hadoop.hdfs.web.URLConnectionFactory@21a728d6
Found 1 items
-rwxrwxrwx   0 NotSupportYet NotSupportYet     671388 2015-09-16 22:16 adl://mydatalakestoragegen1.azuredatalakestore.net:443/mynewfolder

You can also use the hdfs dfs -put command to upload some files to Data Lake Storage Gen1, and then use hdfs dfs -ls to verify whether the files were successfully uploaded.

See Also