管理群集Manage clusters

本文介绍如何管理 Azure Databricks 群集,包括显示、编辑、启动、终止、删除、控制访问权限以及监视性能和日志。This article describes how to manage Azure Databricks clusters, including displaying, editing, starting, terminating, deleting, controlling access, and monitoring performance and logs.

显示分类Display clusters

若要在工作区中显示群集,请单击群集图标To display the clusters in your workspace, click the clusters icon “群集”图标 (在边栏中)。in the sidebar.

“群集”页在以下两个选项卡中显示群集:“通用群集”和“作业群集” 。The Clusters page displays clusters in two tabs: All-Purpose Clusters and Job Clusters.

通用群集all-purpose clusters

作业群集job clusters

每个选项卡都包括:Each tab includes:

  • 群集名称Cluster name
  • StateState
  • 节点数Number of nodes
  • 驱动程序和工作器节点的类型Type of driver and worker nodes
  • Databricks Runtime 版本Databricks Runtime version
  • 群集创建者或作业所有者Cluster creator or job owner

除了常见的群集信息,“通用群集”选项卡还显示附加到该群集的笔记本 附加的笔记本 的数量。In addition to the common cluster information, the All-Purpose Clusters tab shows the numbers of notebooks Attached Notebooks attached to the cluster. 列表上方是固定群集的数量。Above the list is the number of pinned clusters.

通用群集名称左侧的图标指示该群集是否固定、该群集是否提供高并发群集以及是否已启用表访问控制An icon to the left of an all-purpose cluster name indicates whether the cluster is pinned, whether the cluster offers a high concurrency cluster, and whether table access control is enabled:

  • PinnedPinned Pinned
  • 正在启动Starting 正在启动 ,正在终止, Terminating 正在终止
  • 标准群集Standard cluster
    • 运行Running 运行
    • 终止Terminated 终止
  • 高并发性群集High concurrency cluster
    • 运行Running 无服务器
    • 终止Terminated 无服务器,已终止
  • 拒绝访问Access Denied
    • 运行Running 已锁定
    • 终止Terminated 已锁定,已终止
  • 表 ACL 已启用Table ACLs enabled
    • 运行Running 表 ACL
    • 终止Terminated 表 ACL 已终止

通用群集最右侧的链接和按钮提供对 Spark UI 和日志以及终止重启克隆权限删除操作的访问权限。Links and buttons at the far right of an all-purpose cluster provide access to the Spark UI and logs and the terminate, restart, clone, permissions, and delete actions.

群集操作Cluster actions

作业群集最右侧的链接和按钮提供对“作业运行”页、Spark UI 和日志以及终止克隆权限操作的访问权限。Links and buttons at the far right of a job cluster provide access to the Job Run page, Spark UI and logs, and the terminate, clone, and permissions actions.

群集操作Cluster actions

筛选群集列表 Filter cluster list

可以使用右上角的按钮和“筛选器”字段筛选群集列表:You can filter the cluster lists using the buttons and Filter field at the top right:

筛选群集Filter clusters

  • 若要仅显示你创建的群集,请单击“我创建的群集”。To display only clusters that you created, click Created by me.
  • 若要仅显示你可以访问的群集(如果已启用群集访问控制),请单击“我可以访问的群集”。To display only clusters that are accessible to you (if cluster access control is enabled), click Accessible by me.
  • 若要按任何字段中显示的字符串进行筛选,请在“筛选器”文本框中键入字符串。To filter by a string that appears in any field, type the string in the Filter text box.

固定群集 Pin a cluster

群集在被终止 30 天后会被永久删除。30 days after a cluster is terminated, it is permanently deleted. 若要在群集已终止超过 30 天后仍保留通用群集配置,管理员可固定群集To keep an all-purpose cluster configuration even after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. 最多可以固定 20 个群集。Up to 20 clusters can be pinned.

可从以下位置固定群集:You can pin a cluster from the:

  • 群集列表Cluster list

    若要固定或取消固定群集,请单击群集名称左侧的固定图标。To pin or unpin a cluster, click the pin icon to the left of the cluster name.

    在群集列表中固定群集Pin cluster in cluster list

  • “群集详细信息”页Cluster detail page

    若要固定或取消固定群集,请单击群集名称右侧的固定图标。To pin or unpin a cluster, click the pin icon to the right of the cluster name.

    在群集详细信息页中固定群集Pin cluster in cluster detail

还可以调用固定 API 终结点,以编程方式固定群集。You can also invoke the Pin API endpoint to programmatically pin a cluster.

以 JSON 文件的形式查看群集配置 View a cluster configuration as a JSON file

有时,将群集配置视为 JSON 会很有帮助。Sometimes it can be helpful to view your cluster configuration as JSON. 尤其是当你想要使用群集 API 创建相似的群集时,此方法特别有用。This is especially useful when you want to create similar clusters using the Clusters API. 查看现有群集时,只需转到“配置”选项卡,单击此选项卡右上角的“JSON”,复制该 JSON 并将其粘贴到 API 调用中 。When you view an existing cluster, simply go to the Configuration tab, click JSON in the top right of the tab, copy the JSON, and paste it into your API call. JSON 视图为只读。JSON view is ready-only.

群集配置 JSONCluster configuration JSON

编辑群集 Edit a cluster

可以从群集详细信息页中编辑群集配置。You edit a cluster configuration from the cluster detail page.

群集详细信息Cluster detail

还可以调用编辑 API 终结点,以编程方式编辑群集。You can also invoke the Edit API endpoint to programmatically edit the cluster.

备注

  • 附加到群集的笔记本和作业在编辑后将保持附加状态。Notebooks and jobs that were attached to the cluster remain attached after editing.
  • 在群集上安装的库在编辑后将保持安装状态。Libraries installed on the cluster remain installed after editing.
  • 如果要编辑正在运行的群集的任何属性(群集大小和权限除外),则必须重启它。If you edit any attribute of a running cluster (except for the cluster size and permissions), you must restart it. 这可能会影响当前正在使用该群集的用户。This can disrupt users who are currently using the cluster.
  • 只能编辑正在运行或已终止的群集。You can edit only running or terminated clusters. 但可以在“群集详细信息”页上更新未处于这些状态的群集的权限。You can, however, update permissions for clusters that are not in those states on the cluster details page.

有关可编辑的群集配置属性的详细信息,请参阅配置群集For detailed information about cluster configuration properties you can edit, see Configure clusters.

克隆群集 Clone a cluster

可以通过克隆现有群集来创建新群集。You can create a new cluster by cloning an existing cluster.

  • 群集列表Cluster list

    在群集列表中克隆群集Clone cluster in cluster list

  • “群集详细信息”页Cluster detail page

    在“群集详细信息”页中克隆群集Clone cluster in cluster detail

群集创建窗体将打开,其中预填充了群集配置。The cluster creation form is opened prepopulated with the cluster configuration. 以下来自现有群集的属性不包括在克隆中:The following attributes from the existing cluster are not included in the clone:

  • 群集权限Cluster permissions
  • 已安装的库Installed libraries
  • 附加的笔记本Attached notebooks

控制对群集的访问 Control access to clusters

群集访问控制允许管理员和委派的用户向其他用户提供细化的群集访问权限。Cluster access control allows admins and delegated users to give fine-grained cluster access to other users. 一般来说,有两种类型的群集访问控制:Broadly, there are two types of cluster access control:

  1. 群集创建权限:管理员可以选择允许哪些用户创建群集。Cluster creation permission: Admins can choose which users are allowed to create clusters.

    群集创建权限Cluster create persmission

  2. 群集级别权限:具有某群集的“可管理”权限的用户可配置其他用户是否能够通过单击群集操作中的 权限图标 图标来附加到、重启、管理该群集并调整其大小。Cluster-level permissions: A user who has the Can manage permission for a cluster can configure whether other users can attach to, restart, resize, and manage that cluster by clicking the Permissions Icon icon in the cluster actions.

    群集权限Cluster permissions

若要了解如何配置群集访问控制和群集级别权限,请参阅群集访问控制To learn how to configure cluster access control and cluster-level permissions, see Cluster access control.

启动群集 Start a cluster

除了创建新群集,还可以启动先前已终止的群集。Apart from creating a new cluster, you can also start a previously terminated cluster. 这使你可以使用先前已终止的群集的原始配置对其进行重新创建。This lets you re-create a previously terminated cluster with its original configuration.

可从以下位置启动群集:You can start a cluster from the:

  • 群集列表:Cluster list:

    从群集列表启动群集Start cluster from cluster list

  • “群集详细信息”页:Cluster detail page:

    从“群集详细信息”页启动群集Start cluster from cluster detail

  • 笔记本Notebook 笔记本附加 “群集附加”下拉菜单:cluster attach drop-down:

    从“笔记本附加”下拉菜单启动群集Start cluster from notebook attach drop-down

还可以调用启动 API 终结点,以编程方式启动群集。You can also invoke the Start API endpoint to programmatically start a cluster.

Azure Databricks 标识具有唯一群集 ID 的群集。Azure Databricks identifies a cluster with a unique cluster ID. 启动已终止的群集后,Databricks 将重新创建具有相同 ID 的群集,自动安装所有库,然后重新附加笔记本。When you start a terminated cluster, Databricks re-creates the cluster with the same ID, automatically installs all the libraries, and re-attaches the notebooks.

备注

如果使用的是试用版工作区并且该试用版已过期,则将无法启动群集。If you are using a Trial workspace and the trial has expired, you will not be able to start a cluster.

针对作业的群集自动启动 Cluster autostart for jobs

当计划运行分配给现有已终止的群集的作业时,或者从 JDBC/ODBC 接口连接到已终止的群集时,该群集将自动重启。When a job assigned to an existing terminated cluster is scheduled to run or you connect to a terminated cluster from a JDBC/ODBC interface, the cluster is automatically restarted. 请参阅创建作业JDBC 连接See Create a job and JDBC connect.

通过群集自动启动,你可以将群集配置为自动终止,而无需手动干预来为计划的作业重启群集。Cluster autostart allows you to configure clusters to autoterminate without requiring manual intervention to restart the clusters for scheduled jobs. 此外,还可以通过计划作业在已终止的群集上运行来计划群集初始化。Furthermore, you can schedule cluster initialization by scheduling a job to run on a terminated cluster.

自动重启群集之前,会检查群集作业访问控制权限。Before a cluster is restarted automatically, cluster and job access control permissions are checked.

备注

如果群集是在 Azure Databricks 平台版本 2.70 或更早版本中创建的,则不会自动启动:计划在已终止的群集上运行的作业将会失败。If your cluster was created in Azure Databricks platform version 2.70 or earlier, there is no autostart: jobs scheduled to run on terminated clusters will fail.

终止群集 Terminate a cluster

若要保存群集资源,可以终止群集。To save cluster resources, you can terminate a cluster. 已终止的群集不能运行笔记本或作业,但其配置会进行存储,因此可以在稍后的某个时间点重用(或者在某些类型的作业中自动启动)。A terminated cluster cannot run notebooks or jobs, but its configuration is stored so that it can be reused (or—in the case of some types of jobs—autostarted) at a later time. 可以手动终止群集,也可以将群集配置为在处于不活动状态指定时间后自动终止。You can manually terminate a cluster or configure the cluster to automatically terminate after a specified period of inactivity. 每次终止群集时 Azure Databricks 都会记录信息。Azure Databricks records information whenever a cluster is terminated.

终止原因Termination reason

备注

在新的作业群集上运行作业时(建议做法),群集将终止,并且在作业完成后无法重启。When you run a job on a New Job Cluster (which is usually recommended), the cluster terminates and is unavailable for restarting when the job is complete. 另一方面,如果计划在已终止的现有通用群集上运行作业,则该群集将自动启动On the other hand, if you schedule a job to run on an Existing All-Purpose Cluster that has been terminated, that cluster will autostart.

重要

如果使用的是试用版高级工作区,所有运行中的群集将在以下情况下终止:If you are using a Trial Premium workspace, all running clusters are terminated:

  • 将工作区升级到完整的高级版时。When you upgrade a workspace to full Premium.
  • 工作区未升级且试用版过期时。If the workspace is not upgraded and the trial expires.

手动终止Manual termination

可从以下位置手动终止群集You can manually terminate a cluster from the

  • 群集列表Cluster list

    在群集列表中终止群集Terminate cluster in cluster list

  • “群集详细信息”页Cluster detail page

    在群集详细信息页中终止群集Terminate cluster in cluster detail

自动终止Automatic termination

还可以为群集设置自动终止。You can also set auto termination for a cluster. 在群集创建过程中,可以指定希望群集在处于不活动状态几分钟后终止。During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate. 如果当前时间与群集上运行的最后一个命令之间的差值大于处于不活动状态指定时间,则 Azure Databricks 会自动终止该群集。If the difference between the current time and the last command run on the cluster is more than the inactivity period specified, Azure Databricks automatically terminates that cluster.

当群集上的所有命令(包括 Spark 作业、结构化流式处理和 JDBC 调用)执行完毕时,该群集被视为处于不活动状态。A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming, and JDBC calls, have finished executing.

警告

  • 群集不会报告使用 DStreams 时产生的活动。Clusters do not report activity resulting from the use of DStreams. 这意味着自动终止的群集在运行 DStreams 时可能会被终止。This means that an autoterminating cluster may be terminated while it is running DStreams. 请关闭针对运行 DStreams 的群集的自动终止功能,或考虑使用结构化流式处理。Turn off auto termination for clusters running DStreams or consider using Structured Streaming.
  • 自动终止功能仅监视 Spark 作业,不监视用户定义的本地进程。The auto termination feature monitors only Spark jobs, not user-defined local processes. 因此,如果所有 Spark 作业都已完成,则即使本地进程正在运行,也可能终止群集。Therefore, if all Spark jobs have completed, a cluster may be terminated even if local processes are running.

配置自动终止Configure automatic termination

可在群集创建页的“Autopilot 选项”框中的“自动终止”字段中配置自动终止 :You configure automatic termination in the Auto Termination field in the Autopilot Options box on the cluster creation page:

自动终止Auto termination

重要

自动终止设置的默认值取决于你选择创建标准群集还是高并发群集:The default value of the auto terminate setting depends on whether you choose to create a standard or high concurrency cluster:

  • 标准群集配置为在 120 分钟后自动终止。Standard clusters are configured to terminate automatically after 120 minutes.
  • 高并发群集配置为不会自动终止。High concurrency clusters are configured to not terminate automatically.

可通过清除“自动终止”复选框或通过将处于不活动状态的时间指定为 0 来选择退出自动终止。You can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity period of 0.

备注

自动终止在最新的 Spark 版本中最受支持。Auto termination is best supported in the latest Spark versions. 较早的 Spark 版本具有已知的限制,这可能会导致群集活动的报告不准确。Older Spark versions have known limitations which can result in inaccurate reporting of cluster activity. 例如,运行 JDBC、R 或流式处理命令的群集可能会报告过时的活动时间,导致群集提前终止。For example, clusters running JDBC, R, or streaming commands can report a stale activity time that leads to premature cluster termination. 请升级到最新的 Spark 版本,以从 bug 修补程序和改进的自动终止功能中受益。Please upgrade to the most recent Spark version to benefit from bug fixes and improvements to auto termination.

意外终止Unexpected termination

有时,群集会意外终止,而不是手动终止或配置的自动终止。Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. 有关终止原因和修正步骤的列表,请参阅知识库For a list of termination reasons and remediation steps, see the Knowledge Base.

删除群集 Delete a cluster

删除群集会终止群集并删除其配置。Deleting a cluster terminates the cluster and removes its configuration.

警告

不能撤消此操作。You cannot undo this action.

无法删除固定群集。You cannot delete a pinned cluster. 若要删除固定群集,必须先由管理员取消固定该群集。In order to delete a pinned cluster, it must first be unpinned by an administrator.

若要删除群集,请在“作业群集”或“通用群集”选项卡上,单击群集操作中的 删除图标 图标。To delete a cluster, click the Delete Icon icon in the cluster actions on the Job Clusters or All-Purpose Clusters tab.

删除群集Delete cluster

还可以调用永久删除 API 终结点,以编程方式删除群集。You can also invoke the Permanent delete API endpoint to programmatically delete a cluster.

在 Apache Spark UI 中查看群集信息 View cluster information in the Apache Spark UI

有关 Spark 作业的详细信息显示在 Spark UI 中,可从以下位置进行访问:Detailed information about Spark jobs is displayed in the Spark UI, which you can access from:

  • 群集列表:单击群集行上的 Spark UI 链接。The cluster list: click the Spark UI link on the cluster row.
  • 群集详细信息页:单击“Spark UI”选项卡。The cluster details page: click the Spark UI tab.

Spark UI 显示活动群集和已终止群集的群集历史记录。The Spark UI displays cluster history for both active and terminated clusters.

Spark UISpark UI

备注

如果重启已终止的群集,Spark UI 将显示已重启的群集的信息,而不会显示已终止的群集的历史信息。If a terminated cluster is restarted, the Spark UI displays information for the restarted cluster, not the historical information for the terminated cluster.

查看群集日志 View cluster logs

Azure Databricks 提供以下三种与群集相关的活动日志记录:Azure Databricks provides three kinds of logging of cluster-related activity:

本部分讨论群集事件日志、驱动程序和工作器日志。This section discusses cluster event logs and driver and worker logs. 有关 init 脚本日志的详细信息,请参阅 Init 脚本日志For details about init-script logs, see Init script logs.

群集事件日志 Cluster event logs

群集事件日志显示由用户操作手动触发或 Azure Databricks 自动触发的重要群集生命周期事件。The cluster event log displays important cluster lifecycle events that are triggered manually by user actions or automatically by Azure Databricks. 此类事件影响群集的整个操作以及群集中运行的作业。Such events affect the operation of a cluster as a whole and the jobs running in the cluster.

有关受支持的事件类型,请参阅 REST API ClusterEventType 数据结构。For supported event types, see the REST API ClusterEventType data structure.

事件的存储时间为 60 天,相当于 Azure Databricks 中的其他数据保留时间。Events are stored for 60 days, which is comparable to other data retention times in Azure Databricks.

查看群集事件日志View a cluster event log

  1. 单击“群集”图标Click the clusters icon “群集”图标 (在边栏中)。in the sidebar.

  2. 单击群集名称。Click a cluster name.

  3. 单击“事件日志”选项卡。Click the Event Log tab.

    事件日志Event log

若要筛选事件,请单击“按事件类型筛选…”字段中的 下拉菜单To filter the events, click the Menu Dropdown in the Filter by Event Type… 然后选中一个或多个事件类型复选框。field and select one or more event type checkboxes.

使用“全选”可通过排除特定事件类型来简化筛选。Use Select all to make it easier to filter by excluding particular event types.

筛选事件日志Filter event log

查看事件详细信息View event details

有关事件的详细信息,请在日志中单击其所在行,然后单击“JSON”选项卡以查看详细信息。For more information about an event, click its row in the log and then click the JSON tab for details.

事件详细信息Event details

群集驱动程序和工作器日志 Cluster driver and worker logs

笔记本、作业和库中的直接打印和日志语句会转到 Spark 驱动程序日志。The direct print and log statements from your notebooks, jobs, and libraries go to the Spark driver logs. 这些日志包含三个输出:These logs have three outputs:

  • 标准输出Standard output
  • 标准错误Standard error
  • Log4j 日志Log4j logs

若要从 UI 访问这些驱动程序日志文件,请转到群集详细信息页上的“驱动程序日志”选项卡。To access these driver log files from the UI, go to the Driver Logs tab on the cluster details page.

驱动程序日志Driver logs

日志文件会定期更新。Log files are rotated periodically. 较早的日志文件显示在页面顶部,其中还列出了时间戳信息。Older log files appear at the top of the page, listed with timestamp information. 可以下载任何日志以便进行故障排除。You can download any of the logs for troubleshooting.

若要查看 Spark 工作器日志,可以使用 Spark UI。To view Spark worker logs, you can use the Spark UI. 还可以为群集配置日志传递位置You can also configure a log delivery location for the cluster. 工作器和群集日志均传递到指定位置。Both worker and cluster logs are delivered to the location you specify.

监视性能 Monitor performance

为了帮助你监视 Azure Databricks 群集的性能,Azure Databricks 提供了从群集详细信息页访问 Ganglia 指标的权限。To help you monitor the performance of Azure Databricks clusters, Azure Databricks provides access to Ganglia metrics from the cluster details page.

此外,你还可以配置 Azure Databricks 群集,以将指标发送到 Azure Monitor(Azure 的监视平台)中的 Log Analytics 工作区。In addition, you can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure Monitor, the monitoring platform for Azure.

还可以在群集节点上安装 Datadog 代理,以将 Datadog 指标发送到 Datadog 帐户。You can also install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account.

Ganglia 指标Ganglia metrics

若要访问 Ganglia UI,请导航到群集详细信息页上的“指标”选项卡。To access the Ganglia UI, navigate to the Metrics tab on the cluster details page. Ganglia UI 中提供 CPU 指标,可用于所有 Databricks 运行时。CPU metrics are available in the Ganglia UI for all Databricks runtimes. GPU 指标适用于已启用 GPU 的群集。GPU metrics are available for GPU-enabled clusters.

Ganglia 指标Ganglia metrics

若要查看实时指标,请单击“Ganglia UI”链接。To view live metrics, click the Ganglia UI link.

若要查看历史指标,请单击快照文件。To view historical metrics, click a snapshot file. 快照包含所选时间前一小时的聚合指标。The snapshot contains aggregated metrics for the hour preceding the selected time.

配置指标集合Configure metrics collection

默认情况下,Azure Databricks 每 15 分钟收集一次 Ganglia 指标。By default, Azure Databricks collects Ganglia metrics every 15 minutes. 若要配置集合时间,请使用 init 脚本或在群集创建 API 的 spark_env_vars 字段中设置 DATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES 环境变量。To configure the collection period, set the DATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES environment variable using an init script or in the spark_env_vars field in the Cluster Create API.

Azure Monitor Azure Monitor

可以配置 Azure Databricks 群集,以将指标发送到 Azure Monitor(Azure 的监视平台)中的 Log Analytics 工作区。You can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure Monitor, the monitoring platform for Azure. 有关完整说明,请参阅监视 Azure DatabricksFor complete instructions, see Monitoring Azure Databricks.

备注

如果你已在自己的虚拟网络中部署了 Azure Databricks 工作区,并且已将网络安全组 (NSG) 配置为拒绝 Azure Databricks 不需要的所有出站流量,则必须为 AzureMonitor 服务标记配置其他出站规则。If you have deployed the Azure Databricks workspace in your own virtual network and you have configured network security groups (NSG) to deny all outbound traffic that is not required by Azure Databricks, then you must configure an additional outbound rule for the “AzureMonitor” service tag.

Datadog 指标Datadog metrics

Datadog 指标Datadog metrics

可以在群集节点上安装 Datadog 代理,以将 Datadog 指标发送到 Datadog 帐户。You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account. 以下笔记本演示了如何使用群集范围 init 脚本在群集上安装 Datadog 代理。The following notebook demonstrates how to install a Datadog agent on a cluster using a cluster-scoped init script.

若要在所有群集上安装 Datadog 代理,请在测试群集范围 init 脚本后,使用全局 init 脚本To install the Datadog agent on all clusters, use a global init script after testing the cluster-scoped init script.

安装 Datadog 代理 init 脚本笔记本Install Datadog agent init script notebook

获取笔记本Get notebook