从 Azure Databricks 访问 Azure Data Lake Storage Gen1

项目
03/01/2024

Microsoft 已宣布计划停用 Azure Data Lake Storage Gen1（原 Azure Data Lake Store，也称为 ADLS），并建议所有用户迁移到 Azure Data Lake Storage Gen2。 Databricks 建议升级到 Azure Data Lake Storage Gen2，以获得最佳性能和新功能。

有两种方法可以访问 Azure Data Lake Storage Gen1：

传递 Microsoft Entra ID（以前称为 Azure Active Directory）凭据，也称为凭据直通。
直接使用服务主体

使用 Microsoft Entra ID（以前称为 Azure Active Directory）凭据自动访问

可以使用登录 Azure Databricks 时所用的同一 Microsoft Entra ID 标识，自动从 Azure Databricks 群集向 Azure Data Lake Storage Gen1 进行身份验证。为群集启用 Microsoft Entra ID 凭据传递时，在该群集上运行的命令将可以在 Azure Data Lake Storage Gen1 中读取和写入数据，无需你配置用于访问存储的服务主体凭据。

有关完整设置和使用说明，请参阅使用 Microsoft Entra ID（以前称为 Azure Active Directory）凭据传递（旧版）访问 Azure Data Lake Storage。

创建服务主体并向其授予权限

如果所选访问方法需要具有足够权限的服务主体，而你没有这样的服务主体，请按照以下步骤操作：

创建可访问资源的 Microsoft Entra ID（以前称为 Azure Active Directory）应用程序和服务主体。请注意以下属性：
- application-id：唯一标识客户端应用程序的 ID。
- directory-id：唯一标识 Microsoft Entra ID 实例的 ID。
- service-credential：应用程序用来证明其身份的字符串。
注册服务主体，并在 Azure Data Lake Storage Gen1 帐户上授予正确的角色分配，如参与者。

使用服务主体和 OAuth 2.0 通过 Spark API 直接访问

若要从 Azure Data Lake Storage Gen1 帐户读取数据，可以将 Spark 配置为在笔记本中结合使用服务凭据和以下代码片段：

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

其中

dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") 用于检索存储帐户访问密钥，该密钥已作为机密存储在机密范围中。

设置凭据后，可以使用标准 Spark 和 Databricks API 访问资源。例如：

val df = spark.read.format("parquet").load("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

Azure Data Lake Storage Gen1 提供目录级访问控制，因此服务主体必须有权访问要读取的目录以及 Azure Data Lake Storage Gen1 资源。

通过元存储访问

若要访问元存储中指定的 adl:// 位置，必须在创建群集时将 Hadoop 凭据配置选项指定为 Spark 选项，具体方法是将 spark.hadoop. 前缀添加到相应的 Hadoop 配置键，以将其传播到元存储所用的 Hadoop 配置：

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <application-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token

警告

这些凭据可供访问群集的所有用户使用。

装载 Azure Data Lake Storage Gen1 资源或文件夹

若要装载 Azure Data Lake Storage Gen1 资源或其内部文件夹，请使用以下命令：

Python

configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
          "fs.adl.oauth2.client.id": "<application-id>",
          "fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
          "fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Scala

val configs = Map(
  "fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  "fs.adl.oauth2.client.id" -> "<application-id>",
  "fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
  "fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)

其中

<mount-name> 是 DBFS 路径，用于表示 Azure Data Lake Storage Gen1 帐户或其中的文件夹（在 source 中指定）将在 DBFS 中装载的位置。
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>") 用于检索存储帐户访问密钥，该密钥已作为机密存储在机密范围中。

像访问本地文件一样访问容器中的文件，例如：

Python

df = spark.read.format("text").load("/mnt/<mount-name>/....")
df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

Scala

val df = spark.read.format("text").load("/mnt/<mount-name>/....")
val df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

为多个帐户设置服务凭据

现在可以为多个 Azure Data Lake Storage Gen1 帐户设置在单个 Spark 会话中使用的服务凭据，方法是将 account.<account-name> 添加到配置键。例如，如果要设置帐户凭据以同时访问 adl://example1.azuredatalakestore.net 和 adl://example2.azuredatalakestore.net，可以按照以下代码来这样做：

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")

spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<application-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example1>/oauth2/token")

spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<application-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example2>/oauth2/token")

这也适用于群集 Spark 配置：

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential

spark.hadoop.fs.adl.account.example1.oauth2.client.id <application-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example1>/oauth2/token

spark.hadoop.fs.adl.account.example2.oauth2.client.id <application-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example2>/oauth2/token

以下笔记本演示如何直接使用装载操作访问 Azure Data Lake Storage Gen1。

ADLS Gen1 服务主体笔记本

获取笔记本

从 Azure Databricks 访问 Azure Data Lake Storage Gen1

使用 Microsoft Entra ID（以前称为 Azure Active Directory）凭据自动访问

创建服务主体并向其授予权限

使用服务主体和 OAuth 2.0 通过 Spark API 直接访问

通过元存储访问

装载 Azure Data Lake Storage Gen1 资源或文件夹

Python

Scala

Python

Scala

为多个帐户设置服务凭据

ADLS Gen1 服务主体笔记本

其他资源