Databricks SDK for R

發行項
04/30/2024

注意

本文涵蓋 Databricks Labs 的 Databricks SDK for R，其處於實驗狀態。若要提供意見反應、詢問問題及回報問題，請使用 GitHub 中 Databricks SDK for R 存放庫中的 [問題 ] 索引卷標。

在本文中，您將瞭解如何使用適用於 R 的 Databricks SDK，將 Azure Databricks 工作區和相關資源的作業自動化。本文補充 Databricks SDK for R 檔。

注意

Databricks SDK for R 不支援 Azure Databricks 帳戶中的作業自動化。若要呼叫帳戶層級作業，請使用不同的 Databricks SDK，例如：

適用於 Python 的 Databricks SDK
適用於 Java 的 Databricks SDK
Databricks SDK for Go

開始之前

開始使用 Databricks SDK for R 之前，您的開發電腦必須具有：

您想要自動化之目標 Azure Databricks 工作區的 Azure Databricks 個人存取令牌。

注意

Databricks SDK for R 僅支援 Azure Databricks 個人存取令牌驗證。
R，以及選擇性的 R 兼容整合開發環境（IDE）。 Databricks 建議 RStudio Desktop ，並在本文的指示中使用。

開始使用 Databricks SDK for R

讓您的 Azure Databricks 工作區 URL 和個人存取令牌可供 R 專案的腳本使用。例如，您可以將下列內容新增至 R 專案的 .Renviron 檔案。以您的個別工作區 URL 取代 <your-workspace-url> ，例如 https://adb-1234567890123456.7.azuredatabricks.net。以您的 Azure Databricks 個人存取權杖取代 <your-personal-access-token> ，例如 dapi12345678901234567890123456789012。
```
DATABRICKS_HOST=<your-workspace-url>
DATABRICKS_TOKEN=<your-personal-access-token>
```
若要建立 Azure Databricks 個人存取令牌，請執行下列動作：
1. 在 Azure Databricks 工作區中，按兩下頂端列中的 Azure Databricks 使用者名稱，然後從下拉式清單中選取 [設定]。
2. 按兩下 [ 開發人員]。
3. 按兩下 [存取令牌] 旁的 [管理]。
4. 按兩下 [ 產生新的令牌]。
5. （選擇性）輸入批注，協助您在未來識別此令牌，並變更令牌的預設存留期 90 天。若要建立沒有存留期的令牌（不建議），請將 [ 存留期（天）] 方塊保留空白（空白）。
6. 按一下 [產生]。
7. 將顯示的令牌複製到安全的位置，然後按兩下 [ 完成]。
注意

請務必將複製的令牌儲存在安全的位置。請勿與其他人共享複製的令牌。如果您遺失複製的令牌，就無法重新產生完全相同的令牌。相反地，您必須重複此程式來建立新的令牌。如果您遺失複製的令牌，或您認為令牌已遭入侵，Databricks 強烈建議您按兩下存取令牌頁面上令牌旁邊的垃圾桶（Revoke）圖示，立即從工作區中刪除該令牌。

如果您無法在工作區中建立或使用令牌，這可能是因為您的工作區系統管理員已停用令牌，或未授與您建立或使用令牌的許可權。請參閱您的工作區管理員或下列專案：
- 啟用或停用工作區的個人存取令牌驗證
- 個人存取令牌許可權
如需提供 Azure Databricks 工作區 URL 和個人存取令牌的其他方式，請參閱 GitHub 中 Databricks SDK for R 存放庫中的驗證。

重要

請勿將檔案新增 .Renviron 至版本控制系統，因為這可能會公開敏感性資訊，例如 Azure Databricks 個人存取令牌。
安裝 Databricks SDK for R 套件。例如，在 RStudio Desktop 的控制台檢視中（檢視>焦點移至主控台），執行下列命令，一次一個：
```
install.packages("devtools")
library(devtools)
install_github("databrickslabs/databricks-sdk-r")
```
注意

適用於 R 的 Databricks SDK 套件無法在 CRAN 上使用。
新增程式代碼以參考適用於 R 的 Databricks SDK，並列出 Azure Databricks 工作區中的所有叢集。例如，在項目的 main.r 檔案中，程序代碼可能如下所示：
```
require(databricks)

client <- DatabricksClient()

list_clusters(client)[, "cluster_name"]
```
執行您的腳本。例如，在 RStudio Desktop 的腳本編輯器中，使用專案的main.r檔案使用中，按兩下 [來源>來源] 或 [使用 Echo 的來源]。
叢集清單隨即出現。例如，在 RStudio Desktop 中，這是在控制台檢視中。

程式碼範例

下列程式代碼範例示範如何使用 Databricks SDK for R 來建立和刪除叢集，以及建立作業。

建立叢集
永久刪除叢集
建立作業

建立叢集

此程式代碼範例會建立具有指定 Databricks Runtime 版本和叢集節點類型的叢集。此叢集有一個背景工作角色，且叢集會在閑置時間 15 分鐘後自動終止。

require(databricks)

client <- DatabricksClient()

response <- create_cluster(
  client = client,
  cluster_name = "my-cluster",
  spark_version = "12.2.x-scala2.12",
  node_type_id = "Standard_DS3_v2",
  autotermination_minutes = 15,
  num_workers = 1
)

# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]

# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
  host <- paste(host, "/", sep = "")
}

print(paste(
  "View the cluster at ",
  host,
  "#setting/clusters/",
  response$cluster_id,
  "/configuration",
  sep = "")
)

永久刪除叢集

此程式代碼範例會從工作區中永久刪除具有指定叢集標識符的叢集。

require(databricks)

client <- DatabricksClient()

cluster_id <- readline("ID of the cluster to delete (for example, 1234-567890-ab123cd4):")

delete_cluster(client, cluster_id)

建立作業

此程式代碼範例會建立 Azure Databricks 作業，可用來在指定的叢集上執行指定的筆記本。當此程式代碼執行時，它會從控制台的使用者取得現有的筆記本路徑、現有的叢集標識碼和相關作業設定。

require(databricks)

client <- DatabricksClient()

job_name <- readline("Some short name for the job (for example, my-job):")
description <- readline("Some short description for the job (for example, My job):")
existing_cluster_id <- readline("ID of the existing cluster in the workspace to run the job on (for example, 1234-567890-ab123cd4):")
notebook_path <- readline("Workspace path of the notebook to run (for example, /Users/someone@example.com/my-notebook):")
task_key <- readline("Some key to apply to the job's tasks (for example, my-key):")

print("Attempting to create the job. Please wait...")

notebook_task <- list(
  notebook_path = notebook_path,
  source = "WORKSPACE"
)

job_task <- list(
  task_key = task_key,
  description = description,
  existing_cluster_id = existing_cluster_id,
  notebook_task = notebook_task
)

response <- create_job(
  client,
  name = job_name,
  tasks = list(job_task)
)

# Get the workspace URL to be used in the following results message.
get_client_debug <- strsplit(client$debug_string(), split = "host=")
get_host <- strsplit(get_client_debug[[1]][2], split = ",")
host <- get_host[[1]][1]

# Make sure the workspace URL ends with a forward slash.
if (endsWith(host, "/")) {
} else {
  host <- paste(host, "/", sep = "")
}

print(paste(
  "View the job at ",
  host,
  "#job/",
  response$job_id,
  sep = "")
)

記錄

您可以使用熱門 logging 套件來記錄訊息。此套件支援多個記錄層級和自定義記錄格式。您可以使用此套件將訊息記錄至主控台或檔案。若要記錄訊息，請執行下列動作：

安裝 logging 套件。例如，在 RStudio Desktop 的 控制台 檢視中（檢視 > 焦點移至主控台），執行下列命令：
```
install.packages("logging")
library(logging)
```
啟動記錄套件、設定記錄訊息的記錄位置，以及設定記錄層級。例如，下列程式代碼會將所有 ERROR 訊息和下方記錄至 results.log 檔案。
```
basicConfig()
addHandler(writeToFile, file="results.log")
setLevel("ERROR")
```

視需要記錄訊息。例如，如果程式代碼無法驗證或列出可用叢集的名稱，下列程式代碼會記錄任何錯誤。

require(databricks)
require(logging)

basicConfig()
addHandler(writeToFile, file="results.log")
setLevel("ERROR")

tryCatch({
  client <- DatabricksClient()
}, error = function(e) {
  logerror(paste("Error initializing DatabricksClient(): ", e$message))
  return(NA)
})

tryCatch({
  list_clusters(client)[, "cluster_name"]
}, error = function(e) {
  logerror(paste("Error in list_clusters(client): ", e$message))
  return(NA)
})

測試

若要測試程序代碼，您可以使用 R 測試架構，例如 testthat。若要在不呼叫 Azure Databricks REST API 端點或變更 Azure Databricks 帳戶或工作區的狀態的情況下，在模擬條件下測試程式代碼，您可以使用 R 模擬連結庫，例如嘲弄。

例如，假設下列名為 helpers.r 的檔案，其中包含 createCluster 傳回新叢集相關信息的函式：

library(databricks)

createCluster <- function(
  databricks_client,
  cluster_name,
  spark_version,
  node_type_id,
  autotermination_minutes,
  num_workers
) {
  response <- create_cluster(
    client = databricks_client,
    cluster_name = cluster_name,
    spark_version = spark_version,
    node_type_id = node_type_id,
    autotermination_minutes = autotermination_minutes,
    num_workers = num_workers
  )
  return(response)
}

並指定下列名為的檔案，該檔案會 main.R 呼叫函 createCluster 式：

library(databricks)
source("helpers.R")

client <- DatabricksClient()

# Replace <spark-version> with the target Spark version string.
# Replace <node-type-id> with the target node type string.
response = createCluster(
  databricks_client = client,
  cluster_name = "my-cluster",
  spark_version = "<spark-version>",
  node_type_id = "<node-type-id>",
  autotermination_minutes = 15,
  num_workers = 1
)

print(response$cluster_id)

下列名為 test-helpers.py 的檔案會測試函 createCluster 式是否傳回預期的回應。此測試會模擬 DatabricksClient 物件、定義模擬對象的設定，然後將模擬物件傳遞至 createCluster 函式，而不是在目標工作區中建立叢集。然後測試會檢查函式是否傳回新模擬叢集的預期標識碼。

# install.packages("testthat")
# install.pacakges("mockery")
# testthat::test_file("test-helpers.R")
lapply(c("databricks", "testthat", "mockery"), library, character.only = TRUE)
source("helpers.R")

test_that("createCluster mock returns expected results", {
  # Create a mock response.
  mock_response <- list(cluster_id = "abc123")

  # Create a mock function for create_cluster().
  mock_create_cluster <- mock(return_value = mock_response)

  # Run the test with the mock function.
  with_mock(
    create_cluster = mock_create_cluster,
    {
      # Create a mock Databricks client.
      mock_client <- mock()

      # Call the function with the mock client.
      # Replace <spark-version> with the target Spark version string.
      # Replace <node-type-id> with the target node type string.
      response <- createCluster(
        databricks_client = mock_client,
        cluster_name = "my-cluster",
        spark_version = "<spark-version>",
        node_type_id = "<node-type-id>",
        autotermination_minutes = 15,
        num_workers = 1
      )

      # Check that the function returned the correct mock response.
      expect_equal(response$cluster_id, "abc123")
    }
  )
})

其他資源

如需詳細資訊，請參閱

Share via

Databricks SDK for R

開始之前

開始使用 Databricks SDK for R

程式碼範例

建立叢集

永久刪除叢集

建立作業

記錄

測試

其他資源

意見反應

意見反應

其他資源