Add additional storage accounts to HDInsight

Learn how to use script actions to add additional Azure storage accounts to HDInsight. The steps in this document add a storage account to an existing Linux-based HDInsight cluster.

Important

The information in this document is about adding additional storage to a cluster after it has been created. For information on adding storage accounts during cluster creation, see Set up clusters in HDInsight with Hadoop, Spark, Kafka, and more.

How it works

This script takes the following parameters:

  • Azure storage account name: The name of the storage account to add to the HDInsight cluster. After running the script, HDInsight can read and write data stored in this storage account.

  • Azure storage account key: A key that grants access to the storage account.

  • -p (optional): If specified, the key is not encrypted and is stored in the core-site.xml file as plain text.

During processing, the script performs the following actions:

  • If the storage account already exists in the core-site.xml configuration for the cluster, the script exits and no further actions are performed.

  • Verifies that the storage account exists and can be accessed using the key.

  • Encrypts the key using the cluster credential.

  • Adds the storage account to the core-site.xml file.

  • Stops and restarts the Oozie, YARN, MapReduce2, and HDFS services. Stopping and starting these services allows them to use the new storage account.

Warning

Using a storage account in a different location than the HDInsight cluster is not supported.

The script

Script location: https://hdiconfigactions.blob.core.windows.net/linuxaddstorageaccountv01/add-storage-account-v01.sh

Requirements:

  • The script must be applied on the Head nodes.

To use the script

This script can be used from the Azure portal, Azure PowerShell, or the Azure CLI 1.0. For more information, see the Customize Linux-based HDInsight clusters using script action document.

Important

When using the steps provided in the customization document, use the following information to apply this script:

Known issues

Storage accounts not displayed in Azure portal or tools

When viewing the HDInsight cluster in the Azure portal, selecting the Storage Accounts entry under Properties does not display storage accounts added through this script action. Azure PowerShell and Azure CLI do not display the additional storage account either.

The storage information isn't displayed because the script only modifies the core-site.xml configuration for the cluster. This information is not used when retrieving the cluster information using Azure management APIs.

To view storage account information added to the cluster using this script, use the Ambari REST API. Use the following commands to retrieve this information for your cluster:

$creds = Get-Credential -UserName "admin" -Message "Enter the cluster login credentials"
$resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/api/v1/clusters/$clusterName/configurations/service_config_versions?service_name=HDFS&service_config_version=1" `
    -Credential $creds
$respObj = ConvertFrom-Json $resp.Content
$respObj.items.configurations.properties."fs.azure.account.key.$storageAccountName.blob.core.windows.net"

Note

Set $clusterName to the name of the HDInsight cluster. Set $storageAccountName to the name of the storage account. When prompted, enter the cluster login (admin) and password.

curl -u admin:PASSWORD -G "https://CLUSTERNAME.azurehdinsight.net/api/v1/clusters/CLUSTERNAME/configurations/service_config_versions?service_name=HDFS&service_config_version=1" | jq '.items[].configurations[].properties["fs.azure.account.key.$STORAGEACCOUNTNAME.blob.core.windows.net"] | select(. != null)'

Note

Set $PASSWORD to the cluster login (admin) account password. Set $CLUSTERNAME to the name of the HDInsight cluster. Set $STORAGEACCOUNTNAME to the name of the storage account.

This example uses curl (http://curl.haxx.se/) and jq (https://stedolan.github.io/jq/) to retrieve and parse JSON data.

When using this command, replace CLUSTERNAME with the name of the HDInsight cluster. Replace PASSWORD with the HTTP login password for the cluster. Replace STORAGEACCOUNT with the name of the storage account added using script action. Information returned from this command appears similar to the following text:

"MIIB+gYJKoZIhvcNAQcDoIIB6zCCAecCAQAxggFaMIIBVgIBADA+MCoxKDAmBgNVBAMTH2RiZW5jcnlwdGlvbi5henVyZWhkaW5zaWdodC5uZXQCEA6GDZMW1oiESKFHFOOEgjcwDQYJKoZIhvcNAQEBBQAEggEATIuO8MJ45KEQAYBQld7WaRkJOWqaCLwFub9zNpscrquA2f3o0emy9Vr6vu5cD3GTt7PmaAF0pvssbKVMf/Z8yRpHmeezSco2y7e9Qd7xJKRLYtRHm80fsjiBHSW9CYkQwxHaOqdR7DBhZyhnj+DHhODsIO2FGM8MxWk4fgBRVO6CZ5eTmZ6KVR8wYbFLi8YZXb7GkUEeSn2PsjrKGiQjtpXw1RAyanCagr5vlg8CicZg1HuhCHWf/RYFWM3EBbVz+uFZPR3BqTgbvBhWYXRJaISwssvxotppe0ikevnEgaBYrflB2P+PVrwPTZ7f36HQcn4ifY1WRJQ4qRaUxdYEfzCBgwYJKoZIhvcNAQcBMBQGCCqGSIb3DQMHBAhRdscgRV3wmYBg3j/T1aEnO3wLWCRpgZa16MWqmfQPuansKHjLwbZjTpeirqUAQpZVyXdK/w4gKlK+t1heNsNo1Wwqu+Y47bSAX1k9Ud7+Ed2oETDI7724IJ213YeGxvu4Ngcf2eHW+FRK"

This text is an example of an encrypted key, which is used to access the storage account.

Unable to access storage after changing key

If you change the key for a storage account, HDInsight can no longer access the storage account. HDInsight uses a cached copy of key in the core-site.xml for the cluster. This cached copy must be updated to match the new key.

Running the script action again does not update the key, as the script checks to see if an entry for the storage account already exists. If an entry already exists, it does not make any changes.

To work around this problem, you must remove the existing entry for the storage account. Use the following steps to remove the existing entry:

  1. In a web browser, open the Ambari Web UI for your HDInsight cluster. The URI is https://CLUSTERNAME.azurehdinsight.net. Replace CLUSTERNAME with the name of your cluster.

    When prompted, enter the HTTP login user and password for your cluster.

  2. From the list of services on the left of the page, select HDFS. Then select the Configs tab in the center of the page.

  3. In the Filter... field, enter a value of fs.azure.account. This returns entries for any additional storage accounts that have been added to the cluster. There are two types of entries; keyprovider and key. Both contain the name of the storage account as part of the key name.

    The following are example entries for a storage account named mystorage:

     fs.azure.account.keyprovider.mystorage.blob.core.windows.net
     fs.azure.account.key.mystorage.blob.core.windows.net
    
  4. After you have identified the keys for the storage account you need to remove, use the red '-' icon to the right of the entry to delete it. Then use the Save button to save your changes.

  5. After changes have been saved, use the script action to add the storage account and new key value to the cluster.

Poor performance

If the storage account is in a different region than the HDInsight cluster, you may experience poor performance. Accessing data in a different region sends network traffic outside the regional Azure data center and across the public internet, which can introduce latency.

Warning

Using a storage account in a different region than the HDInsight cluster is not supported.

Additional charges

If the storage account is in a different region than the HDInsight cluster, you may notice additional egress charges on your Azure billing. An egress charge is applied when data leaves a regional data center. This charge is applied even if the traffic is destined for another Azure data center in a different region.

Warning

Using a storage account in a different region than the HDInsight cluster is not supported.

Next steps

You have learned how to add additional storage accounts to an existing HDInsight cluster. For more information on script actions, see Customize Linux-based HDInsight clusters using script action