從 Azure Databricks 存取 Azure Data Lake 儲存體 Gen1

發行項
03/01/2024

Microsoft 已宣佈淘汰 Azure Data Lake 儲存體 Gen1（先前稱為 ADLS 的 Azure Data Lake Store），並建議所有使用者移轉至 Azure Data Lake 儲存體 Gen2。 Databricks 建議升級至 Azure Data Lake 儲存體 Gen2，以獲得最佳效能和新功能。

有兩種方式可存取 Azure Data Lake 儲存體 Gen1：

傳遞您的 Microsoft Entra 識別碼（先前稱為 Azure Active Directory）認證，也稱為認證傳遞。
直接使用服務主體。

使用您的 Microsoft Entra 識別碼自動存取（先前稱為 Azure Active Directory）認證

您可以使用您用來登入 Azure Databricks 的相同 Microsoft Entra ID 身分識別，從 Azure Databricks 叢集自動向 Azure Data Lake 儲存體 Gen1 進行驗證。當您啟用叢集以進行 Microsoft Entra ID 認證傳遞時，在該叢集上執行的命令將能夠讀取和寫入 Azure Data Lake 儲存體 Gen1 中的數據，而不需要您設定服務主體認證以存取記憶體。

如需完整的安裝和使用指示，請參閱使用 Microsoft Entra ID（先前稱為 Azure Active Directory）認證傳遞（舊版）存取 Azure Data Lake 儲存體。

建立並授與服務主體的許可權

如果您選取的存取方法需要具有適當許可權的服務主體，而且您沒有許可權，請遵循下列步驟：

建立可存取資源的 Microsoft Entra 識別碼（先前稱為 Azure Active Directory）應用程式和服務主體。請注意下列屬性：
- application-id：可唯一識別用戶端應用程式的標識碼。
- directory-id：可唯一識別 Microsoft Entra ID 實例的標識碼。
- service-credential：應用程式用來證明其身分識別的字串。
在 Azure Data Lake 儲存體 Gen1 帳戶上註冊服務主體，授與正確的角色指派，例如參與者。

使用服務主體和 OAuth 2.0 直接使用 Spark API 存取

若要從 Azure Data Lake 儲存體 Gen1 帳戶讀取，您可以將 Spark 設定為使用服務認證搭配筆記本中的下列代碼段：

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("fs.adl.oauth2.client.id", "<application-id>")
spark.conf.set("fs.adl.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.adl.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

where

dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")擷取已儲存為秘密範圍中秘密的記憶體帳戶存取密鑰。

設定認證之後，您可以使用標準 Spark 和 Databricks API 來存取資源。例如：

val df = spark.read.format("parquet").load("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

dbutils.fs.ls("adl://<storage-resource>.azuredatalakestore.net/<directory-name>")

Azure Data Lake 儲存體 Gen1 提供目錄層級訪問控制，因此服務主體必須能夠存取您想要讀取的目錄，以及 Azure Data Lake 儲存體 Gen1 資源。

透過中繼存放區存取

若要存取 adl:// 中繼存放區中指定的位置，您必須在建立叢集時指定 Hadoop 認證組態選項作為 Spark 選項，方法是將前置詞新增 spark.hadoop. 至對應的 Hadoop 組態索引鍵，以將它們傳播至中繼存放區所使用的 Hadoop 組態：

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential
spark.hadoop.fs.adl.oauth2.client.id <application-id>
spark.hadoop.fs.adl.oauth2.credential <service-credential>
spark.hadoop.fs.adl.oauth2.refresh.url https://login.microsoftonline.com/<directory-id>/oauth2/token

警告

這些認證可供所有存取叢集的使用者使用。

掛接 Azure Data Lake 儲存體 Gen1 資源或資料夾

若要掛接 Azure Data Lake 儲存體 Gen1 資源或其中的資料夾，請使用下列命令：

Python

configs = {"fs.adl.oauth2.access.token.provider.type": "ClientCredential",
          "fs.adl.oauth2.client.id": "<application-id>",
          "fs.adl.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
          "fs.adl.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Scala

val configs = Map(
  "fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
  "fs.adl.oauth2.client.id" -> "<application-id>",
  "fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
  "fs.adl.oauth2.refresh.url" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)

where

<mount-name>是 DBFS 路徑，代表 Azure Data Lake 儲存體 Gen1 帳戶或其內部資料夾在 sourceDBFS 中掛接的位置。
dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")擷取已儲存為秘密範圍中秘密的記憶體帳戶存取密鑰。

存取容器中的檔案，就像是本機檔案一樣，例如：

Python

df = spark.read.format("text").load("/mnt/<mount-name>/....")
df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

Scala

val df = spark.read.format("text").load("/mnt/<mount-name>/....")
val df = spark.read.format("text").load("dbfs:/mnt/<mount-name>/....")

設定多個帳戶的服務認證

您可以藉由將新增account.<account-name>至組態密鑰，為多個 Azure Data Lake 儲存體 Gen1 帳戶設定服務認證，以在單一 Spark 會話中使用。例如，如果您想要設定帳戶的認證來存取 adl://example1.azuredatalakestore.net 和 adl://example2.azuredatalakestore.net，您可以執行下列動作：

spark.conf.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential")

spark.conf.set("fs.adl.account.example1.oauth2.client.id", "<application-id-example1>")
spark.conf.set("fs.adl.account.example1.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example1>"))
spark.conf.set("fs.adl.account.example1.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example1>/oauth2/token")

spark.conf.set("fs.adl.account.example2.oauth2.client.id", "<application-id-example2>")
spark.conf.set("fs.adl.account.example2.oauth2.credential", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential-example2>"))
spark.conf.set("fs.adl.account.example2.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id-example2>/oauth2/token")

這也適用於叢集 Spark 組態：

spark.hadoop.fs.adl.oauth2.access.token.provider.type ClientCredential

spark.hadoop.fs.adl.account.example1.oauth2.client.id <application-id-example1>
spark.hadoop.fs.adl.account.example1.oauth2.credential <service-credential-example1>
spark.hadoop.fs.adl.account.example1.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example1>/oauth2/token

spark.hadoop.fs.adl.account.example2.oauth2.client.id <application-id-example2>
spark.hadoop.fs.adl.account.example2.oauth2.credential <service-credential-example2>
spark.hadoop.fs.adl.account.example2.oauth2.refresh.url https://login.microsoftonline.com/<directory-id-example2>/oauth2/token

下列筆記本示範如何直接使用掛接存取 Azure Data Lake 儲存體 Gen1。

ADLS Gen1 服務主體筆記本

取得筆記本

共用方式為

從 Azure Databricks 存取 Azure Data Lake 儲存體 Gen1

使用您的 Microsoft Entra 識別碼自動存取（先前稱為 Azure Active Directory）認證

建立並授與服務主體的許可權

使用服務主體和 OAuth 2.0 直接使用 Spark API 存取

透過中繼存放區存取

掛接 Azure Data Lake 儲存體 Gen1 資源或資料夾

Python

Scala

Python

Scala

設定多個帳戶的服務認證

ADLS Gen1 服務主體筆記本

其他資源

共用方式為

從 Azure Databricks 存取 Azure Data Lake 儲存體 Gen1

使用您的 Microsoft Entra 識別碼自動存取 （先前稱為 Azure Active Directory） 認證

建立並授與服務主體的許可權

使用服務主體和 OAuth 2.0 直接使用 Spark API 存取

透過中繼存放區存取

掛接 Azure Data Lake 儲存體 Gen1 資源或資料夾

Python

Scala

Python

Scala

設定多個帳戶的服務認證

ADLS Gen1 服務主體筆記本

其他資源

使用您的 Microsoft Entra 識別碼自動存取（先前稱為 Azure Active Directory）認證