Access Azure Data Lake Storage using Azure Active Directory credential passthrough

Note

This article contains references to the term whitelisted, a term that Azure Databricks no longer uses. When the term is removed from the software, we’ll remove it from this article.

You can authenticate automatically to Azure Data Lake Storage Gen1 (ADLS Gen1) and Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure Data Lake Storage credential passthrough, commands that you run on that cluster can read and write data in Azure Data Lake Storage without requiring you to configure service principal credentials for access to storage.

Azure Data Lake Storage credential passthrough is supported with Azure Data Lake Storage Gen1 and Gen2 only. Azure Blob storage does not support credential passthrough.

This article covers:

  • Enabling credential passthrough for standard and high-concurrency clusters.
  • Configuring credential passthrough and initializing storage resources in ADLS accounts.
  • Accessing ADLS resources directly when credential passthrough is enabled.
  • Accessing ADLS resources through a mount point when credential passthrough is enabled.
  • Supported features and limitations when using credential passthrough.

Notebooks are included to provide examples of using credential passthrough with ADLS Gen1 and ADLS Gen2 storage accounts.

Requirements

Important

You cannot authenticate to Azure Data Lake Storage with your Azure Active Directory credentials if you are behind a firewall that has not been configured to allow traffic to Azure Active Directory. Azure Firewall blocks Active Directory access by default. To allow access, configure the AzureActiveDirectory service tag. You can find equivalent information for network virtual appliances under the AzureActiveDirectory tag in the Azure IP Ranges and Service Tags JSON file. For more information, see Azure Firewall service tags and Azure IP Addresses for Public Cloud.

Logging recommendations

You can log identities passed through to ADLS storage in the Azure storage diagnostic logs. Logging identities allows ADLS requests to be tied to individual users from Azure Databricks clusters. Turn on diagnostic logging on your storage account to start receiving these logs:

  • Azure Data Lake Storage Gen1: Follow the instructions in Enable diagnostic logging for your Data Lake Storage Gen1 account.
  • Azure Data Lake Storage Gen2: Configure using PowerShell with the Set-AzStorageServiceLoggingProperty command. Specify 2.0 as the version, because log entry format 2.0 includes the user principal name in the request.

Enable Azure Data Lake Storage credential passthrough for a high-concurrency cluster

High concurrency clusters can be shared by multiple users. They support only Python, SQL, and R.

Important

Enabling Azure Data Lake Storage credential passthrough for a high-concurrency cluster blocks all ports on the cluster except for ports 44, 53, and 80.

  1. When you create a cluster, set the Cluster Mode to High Concurrency.
  2. Under Advanced Options, select Enable credential passthrough and only allow Python and SQL commands.

Enable credential passthrough for High Concurrency clusters

Enable Azure Data Lake Storage credential passthrough for a standard cluster

Standard clusters with credential passthrough are limited to a single user. Standard clusters support Python, SQL, and Scala. On Databricks Runtime 6.0 and above, they also support SparkR.

You must assign a user at cluster creation, but the cluster can be edited by a user with Can Manage permissions at any time to replace the original user.

Important

The user assigned to the cluster must have at least Can Attach To permissions for the cluster in order to run commands on the cluster. Admins and the cluster creator have Can Manage permissions, but cannot run commands on the cluster unless they are the designated cluster user.

  1. When you create a cluster, set the Cluster Mode to Standard.
  2. Under Advanced Options, select Enable credential passthrough for user-level access and select the user name from the Single User Access drop-down.

Enable credential passthrough for Standard clusters

Create a container

Containers provide a way to organize objects in an Azure storage account. See Create a container for details on creating containers in an Azure Databricks notebook or directly in the Azure portal.

Access Azure Data Lake Storage directly using credential passthrough

After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2 using an abfss:// path:

Azure Data Lake Storage Gen1

spark.read.csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
  • Replace <storage-account-name> with the ADLS Gen1 storage account name.

Azure Data Lake Storage Gen2

spark.read.csv("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/MyData.csv").collect()
  • Replace <container-name> with the name of a container in the ADLS Gen2 storage account.
  • Replace <storage-account-name> with the ADLS Gen2 storage account name.

Mount Azure Data Lake Storage to DBFS using credential passthrough

You can mount an Azure Data Lake Storage account or a folder inside it to Databricks File System (DBFS). The mount is a pointer to a data lake store, so the data is never synced locally.

When you mount data using a cluster enabled with Azure Data Lake Storage credential passthrough, any read or write to the mount point uses your Azure AD credentials. This mount point will be visible to other users, but the only users that will have read and write access are those who:

  • Have access to the underlying Azure Data Lake Storage storage account
  • Are using a cluster enabled for Azure Data Lake Storage credential passthrough

Important

Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount storage as part of processing.

Azure Data Lake Storage Gen1

To mount an Azure Data Lake Storage Gen1 resource or a folder inside it, use the following commands:

Note

As of Databricks Runtime 6.0, the dfs.adls. prefix for Azure Data Lake Storage configuration keys has been deprecated in favor of the new fs.adl. prefix. Backward compatibility is maintained, which means you can still use the old prefix. However, there are two caveats when using the old prefix. The first is that even though keys using the old prefix will be correctly propagated, calling spark.conf.get with a key using the new prefix will fail unless set explicitly. The second is that any error message referencing an Azure Data Lake Storage configuration key will always use the new prefix. For Databricks Runtime versions below 6.0, you must always use the old prefix.

configs = {
  "fs.adl.oauth2.access.token.provider.type": "CustomAccessTokenProvider",
  "fs.adl.oauth2.access.token.custom.provider": spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName")
}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-account-name>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)
val configs = Map(
  "fs.adl.oauth2.access.token.provider.type" -> "CustomAccessTokenProvider",
  "fs.adl.oauth2.access.token.custom.provider" -> spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName")
)

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-account-name>.azuredatalakestore.net/<directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)
  • Replace <storage-account-name> with the ADLS Gen2 storage account name.
  • Replace <mount-name> with the name of the intended mount point in DBFS.

Azure Data Lake Storage Gen2

To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it, use the following commands:

configs = {
  "fs.azure.account.auth.type": "CustomAccessToken",
  "fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)
val configs = Map(
  "fs.azure.account.auth.type" -> "CustomAccessToken",
  "fs.azure.account.custom.token.provider.class" -> spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
)

dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)
  • Replace <container-name> with the name of a container in the ADLS Gen2 storage account.
  • Replace <storage-account-name> with the ADLS Gen2 storage account name.
  • Replace <mount-name> with the name of the intended mount point in DBFS.

Warning

Do not provide your storage account access keys or service principal credentials to authenticate to the mount point. That would give other users access to the filesystem using those credentials. The purpose of Azure Data Lake Storage credential passthrough is to prevent you from having to use those credentials and to ensure that access to the filesystem is restricted to users who have access to the underlying Azure Data Lake Storage account.

Security

It is safe to share Azure Data Lake Storage credential passthrough clusters with other users. You will be isolated from each other and will not be able to read or use each other’s credentials.

Supported features

Feature Minimum Databricks Runtime Version Notes
Python and SQL 5.5
Azure Data Lake Storage Gen1 5.5
%run 5.5
DBFS 5.5 Credentials are passed through only if the DBFS path resolves to a location in Azure Data Lake Storage Gen1 or Gen2. For DBFS paths that resolve to other storage systems, use a different method to specify your credentials.
Azure Data Lake Storage Gen2 5.5
Delta caching 5.5
PySpark ML API 5.5 The following ML classes are not supported:

* org/apache/spark/ml/classification/RandomForestClassifier
* org/apache/spark/ml/clustering/BisectingKMeans
* org/apache/spark/ml/clustering/GaussianMixture
* org/spark/ml/clustering/KMeans
* org/spark/ml/clustering/LDA
* org/spark/ml/evaluation/ClusteringEvaluator
* org/spark/ml/feature/HashingTF
* org/spark/ml/feature/OneHotEncoder
* org/spark/ml/feature/StopWordsRemover
* org/spark/ml/feature/VectorIndexer
* org/spark/ml/feature/VectorSizeHint
* org/spark/ml/regression/IsotonicRegression
* org/spark/ml/regression/RandomForestRegressor
* org/spark/ml/util/DatasetUtils
Broadcast variables 5.5 Within PySpark, there is a limit on the size of the Python UDFs you can construct, since large UDFs are sent as broadcast variables.
Notebook-scoped libraries 5.5
Scala 5.5
Spark R 6.0
Notebook workflows 6.1
PySpark ML API 6.1 All PySpark ML classes supported.
Ganglia UI 6.1
Databricks Connect 7.3 Passthrough is supported on Standard clusters.

Limitations

The following features are not supported with Azure Data Lake Storage credential passthrough:

  • %fs (use the equivalent dbutils.fs command instead).
  • Databricks Jobs.
  • The Databricks REST API.
  • Table access control. The permissions granted by Azure Data Lake Storage credential passthrough could be used to bypass the fine-grained permissions of table ACLs, while the extra restrictions of table ACLs will constrain some of the benefits you get from credential passthrough. In particular:
    • If you have Azure AD permission to access the data files that underlie a particular table you will have full permissions on that table via the RDD API, regardless of the restrictions placed on them via table ACLs.
    • You will be constrained by table ACLs permissions only when using the DataFrame API. You will see warnings about not having permission SELECT on any file if you try to read files directly with the DataFrame API, even though you could read those files directly via the RDD API.
    • You will be unable to read from tables backed by filesystems other than Azure Data Lake Storage, even if you have table ACL permission to read the tables.
  • The following methods on SparkContext (sc) and SparkSession (spark) objects:
    • Deprecated methods.
    • Methods such as addFile() and addJar() that would allow non-admin users to call Scala code.
    • Any method that accesses a filesystem other than Azure Data Lake Storage Gen1 or Gen2 (to access other filesystems on a cluster with Azure Data Lake Storage credential passthrough enabled, use a different method to specify your credentials and see the section on trusted filesystems under Troubleshooting).
    • The old Hadoop APIs (hadoopFile() and hadoopRDD()).
    • Streaming APIs, since the passed-through credentials would expire while the stream was still running.
  • The FUSE mount (/dbfs) is available only in Databricks Runtime 7.3 LTS and above. Mount points with credential passthrough configured are not supported through the FUSE mount.
  • Azure Data Factory.
  • MLflow on high concurrency clusters.
  • azureml-sdk[databricks] Python package on high concurrency clusters.
  • You cannot extend the lifetime of Azure Active Directory passthrough tokens using Azure Active Directory token lifetime policies. As a consequence, if you send a command to the cluster that takes longer than an hour, it will fail if an Azure Data Lake Storage resource is accessed after the 1 hour mark.

Example notebooks

The following notebooks demonstrate Azure Data Lake Storage credential passthrough for Azure Data Lake Storage Gen1 and Gen2.

Azure Data Lake Storage Gen1 passthrough notebook

Get notebook

Azure Data Lake Storage Gen2 passthrough notebook

Get notebook

Troubleshooting

Note

This article contains references to the term whitelisted, a term that Azure Databricks no longer uses. When the term is removed from the software, we’ll remove it from this article.

py4j.security.Py4JSecurityException: … is not whitelisted

This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.

org.apache.spark.api.python.PythonSecurityException: Path … uses an untrusted filesystem

This exception is thrown when you have tried to access a filesystem that is not known by the Azure Data Lake Storage credential passthrough cluster to be safe. Using an untrusted filesystem might allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials, so we disallow all filesystems that we are not confident are being used safely.

To configure the set of trusted filesystems on a Azure Data Lake Storage credential passthrough cluster, set the Spark conf key spark.databricks.pyspark.trustedFilesystems on that cluster to be a comma-separated list of the class names that are trusted implementations of org.apache.hadoop.fs.FileSystem.