Get started with Azure Data Lake Storage Gen2

You can easily authenticate and access Azure Data Lake Storage Gen2 (ADLS Gen2) storage accounts using an Azure storage account access key.

Using an access key is less secure than using a service principal or credential passthrough but can be convenient for non-production scenarios such as developing or testing notebooks.

Although you can use an access key directly from your Azure Databricks workspace, storing the key in a secret scope provides an additional security layer. Secret scopes provide secure storage and management of secrets and allow you to use the access key for authentication without including it directly in your Azure Databricks workspace.

This article explains how to obtain an Azure storage account access key, save that access key in an Azure Key Vault-backed secret scope, and use it to access an ADLS Gen2 storage account from an Azure Databricks notebook. The following is an overview of the tasks to configure an access key and use it to access an ADLS Gen2 storage account:

  1. Obtain an access key from the Azure storage account.
  2. Create an Azure key vault.
  3. Add the Azure access key to the Azure key vault.
  4. Create a secret scope in your Azure Databricks workspace backed by the Azure key vault.
  5. Use the access key from the secret scope to authenticate to the storage account.

This article describes using an Azure Key Vault-backed secret scope, but you can also use a Databricks-backed secret scope to store the access key.

Requirements

Get an Azure ADLS access key

You obtain an access key for the ADLS Gen2 storage account using the Azure portal:

  1. Go to your ADLS Gen2 storage account in the Azure portal.

  2. Under Settings, select Access keys.

  3. Copy the value for one of the available access keys.

    Get access key

Create an Azure key vault and secret scope

To create the Azure key vault and an Azure Databricks secret scope backed by that key vault:

  1. Create an Azure Key Vault instance in the Azure portal.
  2. Create the Azure Databricks secret scope backed by the Azure Key Vault instance.

Step 1: Create an Azure Key Vault instance

  1. In the Azure portal, select Key vaults > + Add and give your key vault a name.

  2. Click Review + create.

  3. After validation completes, click Create.

  4. After creating the key vault, go to the Properties page for the new key vault.

  5. Copy and save the Vault URI and Resource ID.

    Azure key vault properties

Step 2: Create an Azure Key Vault-backed secret scope

Azure Databricks resources can reference secrets stored in an Azure key vault by creating a Key Vault-backed secret scope. You can use the Azure Databricks UI, the Databricks Secrets CLI, or the Databricks Secrets API to create the Azure Key Vault-backed secret scope. This article describes using the UI and CLI.

Create the Azure Databricks secret scope in the Azure Databricks UI
  1. Go to the Azure Databricks Create Secret Scope page at https://<per-workspace-url>/#secrets/createScope. Replace per-workspace-url with the unique per-workspace URL for your Azure Databricks workspace.

  2. Enter a Scope Name.

  3. Enter the Vault URI and Resource ID values for the Azure key vault you created in Step 1: Create an Azure Key Vault instance.

  4. Click Create.

    Create secret scope

Create the Azure Databricks secret scope in the CLI

To create a secret scope backed by the Azure key vault using the Databricks CLI, open a terminal and run the following command:

databricks secrets create-scope \
--scope <scope-name> \
--scope-backend-type AZURE_KEYVAULT \
--resource-id <azure-keyvault-resource-id> \
--dns-name <azure-keyvault-dns-name>

Replace

  • <scope-name> with a name for the new scope.
  • <azure-keyvault-resource-id> with the key vault Resource ID.
  • <azure-keyvault-dns-name> with the Vault URI.

An example using the values from Step 1: Create an Azure Key Vault instance:

databricks secrets create-scope \
--scope example-akv-scope \
--scope-backend-type AZURE_KEYVAULT \
--resource-id /subscriptions/… \
--dns-name https://example-akv.vault.azure.net/

Save the storage account access key in the Azure key vault

  1. In the Azure portal, go to the Key vaults service.

  2. Select the key vault created in Step 1: Create an Azure Key Vault instance.

  3. Under Settings > Secrets, click Generate/Import.

  4. Select the Manual upload option and enter the storage account access key in the Value field.

    Create a secret

Use the Secrets CLI to verify the secret was created successfully:

databricks secrets list --scope example-akv-scope
Key name               Last updated
--------------------   --------------
example-adls2-secret   1605122724000

Authenticate with the access key

The way you set credentials for authentication depends on whether you plan to use the DataFrame or Dataset API, or the RDD API.

DataFrame or DataSet API

If you are using Spark DataFrame or Dataset APIs, Databricks recommends that you set your account credentials in your notebook’s session configs:

spark.conf.set(
    "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
    dbutils.secrets.get(scope="<scope-name>",key="<storage-account-access-key-name>"))

Replace

  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <scope-name> with the Azure Databricks secret scope name.
  • <storage-account-access-key-name> with the name of the key containing the Azure storage account access key.

RDD API

If you’re using the RDD API to access ADLS Gen2, you cannot access Hadoop configuration options set using spark.conf.set(). You must set the credentials using one of the following methods:

  • Specify the Hadoop configuration options as Spark configs when you create the cluster. You must add the spark.hadoop. prefix to the corresponding Hadoop configuration keys to propagate them to the Hadoop configurations for your RDD jobs:

    spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net <storage-account-access-key>
    

    Replace

    • <storage-account-name> with the ADLS Gen2 storage account name.
    • <storage-account-access-key> with the access key you retrieved in Get an Azure ADLS access key

    Warning

    These credentials are available to all users who access the cluster.

  • Scala users can set the credentials in spark.sparkContext.hadoopConfiguration:

    spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.<storage-account-name>.dfs.core.windows.net", "<storage-account-access-key>")
        dbutils.secrets.get(scope="<scope-name>",
        key="<storage-account-access-key-name>")
    )
    

Replace

  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <scope-name> with the Azure Databricks secret scope name.
  • <storage-account-access-key-name> with the name of the key containing the Azure storage account access key.

Create a container

Like directories in a filesystem, containers provide a way to organize objects in an Azure storage account. You must create one or more containers before you can access an ADLS Gen2 storage account.

You can create a container directly from an Azure Databricks notebook by running the following commands. Remove the first statement if you’ve already followed the instructions in Authenticate with the access key.

spark.conf.set(
   "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
   dbutils.secrets.get(scope="<scope-name>",
   key="<storage-account-access-key-name>"))
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Replace

  • <storage-account-name> with the ADLS Gen2 storage account name.
  • <scope-name> with the Azure Databricks secret scope name.
  • <storage-account-access-key-name> with the name of the key containing the Azure storage account access key.
  • <container-name> with the name for the new container.

You can also a create container through the Azure command-line interface, the Azure API, or the Azure portal. To create a container in the portal:

  1. In the Azure portal, go to Storage accounts.

  2. Select your ADLS Gen2 account and click Containers.

  3. Click + Container.

  4. Enter a name for your container and click Create.

    Create a container

Access ADLS Gen2 storage

After authenticating to the ADLS Gen2 storage account, you can use standard Spark and Databricks APIs to read from the account:


val df = spark.read.parquet("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")
dbutils.fs.ls("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

Example notebook

This notebook demonstrates using a storage account access key to:

  1. Authenticate to an ADLS Gen2 storage account.
  2. Create a new container in the storage account.
  3. Write a JSON file containing internet of things (IoT) data to the new container.
  4. List files in the container.
  5. Read and display the IoT file from the container.

Getting started with ADLS Gen2 notebook

Get notebook