Manage clusters

This article describes how to manage Azure Databricks clusters, including displaying, editing, starting, terminating, deleting, controlling access, and monitoring performance and logs.

Display clusters

To display the clusters in your workspace, click compute icon Compute in the sidebar.

The Compute page displays clusters in two tabs: All-purpose clusters and Job clusters.

all-purpose clusters

job clusters

At the left side are two columns indicating if the cluster has been pinned and the status of the cluster:

At the far right of the right side of the All-purpose clusters tab is an icon you can use to terminate the cluster.

You can use the three-button menu three-button menu to restart, clone, delete, or edit permissions for the cluster. Menu options that are not available are grayed out.

vertical 3-button icon

The All-purpose clusters tab shows the numbers of notebooks attached notebooks icon attached to the cluster.

Filter cluster list

You can filter the cluster lists using the buttons and search box at the top right:

Filter clusters

Pin a cluster

30 days after a cluster is terminated, it is permanently deleted. To keep an all-purpose cluster configuration even after a cluster has been terminated for more than 30 days, an administrator can pin the cluster. Up to 100 clusters can be pinned.

You can pin a cluster from the cluster list or the cluster detail page:

Pin cluster from cluster list

To pin or unpin a cluster, click the pin icon to the left of the cluster name.

Pin cluster in cluster list

Pin cluster from cluster detail page

To pin or unpin a cluster, click the pin icon to the right of the cluster name.

Pin cluster in cluster detail

You can also invoke the Pin API endpoint to programmatically pin a cluster.

View a cluster configuration as a JSON file

Sometimes it can be helpful to view your cluster configuration as JSON. This is especially useful when you want to create similar clusters using the Clusters API 2.0. When you view an existing cluster, simply go to the Configuration tab, click JSON in the top right of the tab, copy the JSON, and paste it into your API call. JSON view is ready-only.

Cluster configuration JSON

Edit a cluster

You edit a cluster configuration from the cluster detail page. To display the cluster detail page, click the cluster name on the Compute page.

Cluster detail

You can also invoke the Edit API endpoint to programmatically edit the cluster.

Note

  • Notebooks and jobs that were attached to the cluster remain attached after editing.
  • Libraries installed on the cluster remain installed after editing.
  • If you edit any attribute of a running cluster (except for the cluster size and permissions), you must restart it. This can disrupt users who are currently using the cluster.
  • You can edit only running or terminated clusters. You can, however, update permissions for clusters that are not in those states on the cluster details page.

For detailed information about cluster configuration properties you can edit, see Configure clusters.

Clone a cluster

You can create a new cluster by cloning an existing cluster.

From the cluster list, click the three-button menu three-button menu and select Clone from the drop down.

Cluster list menu

From the cluster detail page, click more button and select Clone from the drop down.

Cluster detail menu

The cluster creation form is opened prepopulated with the cluster configuration. The following attributes from the existing cluster are not included in the clone:

  • Cluster permissions
  • Installed libraries
  • Attached notebooks

Control access to clusters

Cluster access control within the Admin Console allows admins and delegated users to give fine-grained cluster access to other users. There are two types of cluster access control:

  • Cluster creation permission: Admins can choose which users are allowed to create clusters.

    Cluster create permission

  • Cluster-level permissions: A user who has the Can manage permission for a cluster can configure whether other users can attach to, restart, resize, and manage that cluster from the cluster list or the cluster details page.

    From the cluster list, click the three-button menu three-button menu and select Edit Permissions.

    Cluster list menu

    From the cluster detail page, click more button and select Permissions.

    Cluster detail menu

To learn how to configure cluster access control and cluster-level permissions, see Cluster access control.

Start a cluster

Apart from creating a new cluster, you can also start a previously terminated cluster. This lets you re-create a previously terminated cluster with its original configuration.

You can start a cluster from the cluster list, the cluster detail page, or a notebook.

  • To start a cluster from the cluster list, click the arrow:

    Start cluster from cluster list

  • To start a cluster from the cluster detail page, click Start:

    Start cluster from cluster detail

  • Notebook Notebook Attach cluster attach drop-down

    Start cluster from notebook attach drop-down

You can also invoke the Start API endpoint to programmatically start a cluster.

Azure Databricks identifies a cluster with a unique cluster ID. When you start a terminated cluster, Databricks re-creates the cluster with the same ID, automatically installs all the libraries, and re-attaches the notebooks.

Note

If you are using a Trial workspace and the trial has expired, you will not be able to start a cluster.

Cluster autostart for jobs

When a job assigned to an existing terminated cluster is scheduled to run or you connect to a terminated cluster from a JDBC/ODBC interface, the cluster is automatically restarted. See Create a job and JDBC connect.

Cluster autostart allows you to configure clusters to autoterminate without requiring manual intervention to restart the clusters for scheduled jobs. Furthermore, you can schedule cluster initialization by scheduling a job to run on a terminated cluster.

Before a cluster is restarted automatically, cluster and job access control permissions are checked.

Note

If your cluster was created in Azure Databricks platform version 2.70 or earlier, there is no autostart: jobs scheduled to run on terminated clusters will fail.

Terminate a cluster

To save cluster resources, you can terminate a cluster. A terminated cluster cannot run notebooks or jobs, but its configuration is stored so that it can be reused (or—in the case of some types of jobs—autostarted) at a later time. You can manually terminate a cluster or configure the cluster to automatically terminate after a specified period of inactivity. Azure Databricks records information whenever a cluster is terminated. When the number of terminated clusters exceeds 150, the oldest clusters are deleted.

Unless a cluster is pinned, 30 days after the cluster is terminated, it is automatically and permanently deleted.

Terminated clusters appear in the cluster list with a gray circle at the left of the cluster name.

Terminated cluster icon

Note

When you run a job on a New Job Cluster (which is usually recommended), the cluster terminates and is unavailable for restarting when the job is complete. On the other hand, if you schedule a job to run on an Existing All-Purpose Cluster that has been terminated, that cluster will autostart.

Important

If you are using a Trial Premium workspace, all running clusters are terminated:

  • When you upgrade a workspace to full Premium.
  • If the workspace is not upgraded and the trial expires.

Manual termination

You can manually terminate a cluster from the cluster list or the cluster detail page.

  • To terminate a cluster from the cluster list, click the square:

    Terminate cluster in cluster list

  • To terminate a cluster from the cluster detail page, click Terminate:

    Terminate cluster in cluster detail

Automatic termination

You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate. If the difference between the current time and the last command run on the cluster is more than the inactivity period specified, Azure Databricks automatically terminates that cluster.

A cluster is considered inactive when all commands on the cluster, including Spark jobs, Structured Streaming, and JDBC calls, have finished executing.

Warning

  • Clusters do not report activity resulting from the use of DStreams. This means that an autoterminating cluster may be terminated while it is running DStreams. Turn off auto termination for clusters running DStreams or consider using Structured Streaming.
  • The auto termination feature monitors only Spark jobs, not user-defined local processes. Therefore, if all Spark jobs have completed, a cluster may be terminated even if local processes are running.
  • Idle clusters continue to accumulate DBU and cloud instance charges during the inactivity period before termination.

Configure automatic termination

You configure automatic termination in the Auto Termination field in the Autopilot Options box on the cluster creation page:

Auto termination

Important

The default value of the auto terminate setting depends on whether you choose to create a standard or high concurrency cluster:

  • Standard clusters are configured to terminate automatically after 120 minutes.
  • High concurrency clusters are configured to not terminate automatically.

You can opt out of auto termination by clearing the Auto Termination checkbox or by specifying an inactivity period of 0.

Note

Auto termination is best supported in the latest Spark versions. Older Spark versions have known limitations which can result in inaccurate reporting of cluster activity. For example, clusters running JDBC, R, or streaming commands can report a stale activity time that leads to premature cluster termination. Please upgrade to the most recent Spark version to benefit from bug fixes and improvements to auto termination.

Unexpected termination

Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination.

For a list of termination reasons and remediation steps, see the Knowledge Base.

Delete a cluster

Deleting a cluster terminates the cluster and removes its configuration.

Warning

You cannot undo this action.

You cannot delete a pinned cluster. In order to delete a pinned cluster, it must first be unpinned by an administrator.

From the cluster list, click the three-button menu three-button menu and select Delete from the drop down.

Cluster list menu

From the cluster detail page, click more button and select Delete from the drop down.

Cluster detail menu

You can also invoke the Permanent delete API endpoint to programmatically delete a cluster.

Restart a cluster to update it with the latest images

When you restart a cluster, it gets the latest images for the compute resource containers and the VM hosts. It is particularly important to schedule regular restarts for long-running clusters, which are often used for some applications such as processing streaming data.

It is your responsibility to restart all compute resources regularly to keep the image up-to-date with the latest image version.

Important

If you enable the compliance security profile for your account or your workspace, long-running clusters are automatically restarted after 25 days. Databricks recommends that admins restart clusters before they run for 25 days and do so during a scheduled maintenance window. This reduces the risk of an auto-restart disrupting a scheduled job.

You can restart a cluster in multiple ways:

Run a script that determines how many days your clusters have been running, and optionally restart them

If you are a workspace admin, you can run a script that determines how long each of your clusters has been running, and optionally restart them if they are older than a specified number of days. Azure Databricks provides this script as a notebook.

The first lines of the script define configuration parameters:

  • min_age_output: The maximum number of days that a cluster can run. Default is 1.
  • perform_restart: If True, the script restarts clusters with age greater than the number of days specified by min_age_output. The default is False, which identifies the long running clusters but does not restart them.
  • secret_configuration: Replace REPLACE_WITH_SCOPE and REPLACE_WITH_KEY with a secret scope and key name. For more details of setting up the secrets, see the notebook.

Warning

If you set perform_restart to True, the script automatically restarts eligible clusters, which can cause active jobs to fail and reset open notebooks. To reduce the risk of disrupting your workspace’s business critical jobs, plan a scheduled maintenance window and be sure to notify workspace users.

Identify and optionally restart long-running clusters notebook

Get notebook

View cluster information in the Apache Spark UI

You can view detailed information about Spark jobs in the Spark UI, which you can access from the Spark UI tab on the cluster details page.

Spark UI

You can get details about active and terminated clusters.

If you restart a terminated cluster, the Spark UI displays information for the restarted cluster, not the historical information for the terminated cluster.

View cluster logs

Azure Databricks provides three kinds of logging of cluster-related activity:

This section discusses cluster event logs and driver and worker logs. For details about init-script logs, see Init script logs.

Cluster event logs

The cluster event log displays important cluster lifecycle events that are triggered manually by user actions or automatically by Azure Databricks. Such events affect the operation of a cluster as a whole and the jobs running in the cluster.

For supported event types, see the REST API ClusterEventType data structure.

Events are stored for 60 days, which is comparable to other data retention times in Azure Databricks.

View a cluster event log

  1. Click compute icon Compute in the sidebar.

  2. Click a cluster name.

  3. Click the Event Log tab.

    Event log

To filter the events, click the Menu Dropdown in the Filter by Event Type… field and select one or more event type checkboxes.

Use Select all to make it easier to filter by excluding particular event types.

Filter event log

View event details

For more information about an event, click its row in the log and then click the JSON tab for details.

Event details

Cluster driver and worker logs

The direct print and log statements from your notebooks, jobs, and libraries go to the Spark driver logs. These logs have three outputs:

  • Standard output
  • Standard error
  • Log4j logs

You can access these files from the Driver logs tab on the cluster details page. Click the name of a log file to download it.

To view Spark worker logs, you can use the Spark UI. You can also configure a log delivery location for the cluster. Both worker and cluster logs are delivered to the location you specify.

Monitor performance

To help you monitor the performance of Azure Databricks clusters, Azure Databricks provides access to Ganglia metrics from the cluster details page.

In addition, you can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure Monitor, the monitoring platform for Azure.

You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account.

Ganglia metrics

To access the Ganglia UI, navigate to the Metrics tab on the cluster details page. CPU metrics are available in the Ganglia UI for all Databricks runtimes. GPU metrics are available for GPU-enabled clusters.

Ganglia metrics

To view live metrics, click the Ganglia UI link.

To view historical metrics, click a snapshot file. The snapshot contains aggregated metrics for the hour preceding the selected time.

Configure metrics collection

By default, Azure Databricks collects Ganglia metrics every 15 minutes. To configure the collection period, set the DATABRICKS_GANGLIA_SNAPSHOT_PERIOD_MINUTES environment variable using an init script or in the spark_env_vars field in the Cluster Create API.

Azure Monitor

You can configure an Azure Databricks cluster to send metrics to a Log Analytics workspace in Azure Monitor, the monitoring platform for Azure. For complete instructions, see Monitoring Azure Databricks.

Note

If you have deployed the Azure Databricks workspace in your own virtual network and you have configured network security groups (NSG) to deny all outbound traffic that is not required by Azure Databricks, then you must configure an additional outbound rule for the “AzureMonitor” service tag.

Datadog metrics

Datadog metrics

You can install Datadog agents on cluster nodes to send Datadog metrics to your Datadog account. The following notebook demonstrates how to install a Datadog agent on a cluster using a cluster-scoped init script.

To install the Datadog agent on all clusters, use a global init script after testing the cluster-scoped init script.

Install Datadog agent init script notebook

Get notebook

Decommission spot instances

Note

This feature is available on Databricks Runtime 8.0 and above.

Because spot instances can reduce costs, creating clusters using spot instances rather than on-demand instances is a common way to run jobs. However, spot instances can be preempted by cloud provider scheduling mechanisms. Preemption of spot instances can cause issues with jobs that are running, including:

  • Shuffle fetch failures
  • Shuffle data loss
  • RDD data loss
  • Job failures

You can enable decommissioning to help address these issues. Decommissioning takes advantage of the notification that the cloud provider usually sends before a spot instance is decommissioned. When a spot instance containing an executor receives a preemption notification, the decommissioning process will attempt to migrate shuffle and RDD data to healthy executors. The duration before the final preemption is typically 30 seconds to 2 minutes, depending on the cloud provider.

Databricks recommends enabling data migration when decommissioning is also enabled. Generally, the possibility of errors decreases as more data is migrated, including shuffle fetching failures, shuffle data loss, and RDD data loss. Data migration can also lead to less re-computation and save cost.

Decommissioning is best effort and does not guarantee that all data can be migrated before final preemption. Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data from the executor.

With decommissioning enabled, task failures caused by spot instance preemption are not added to the total number of failed attempts. Task failures caused by preemption are not counted as failed attempts because the cause of the failure is external to the task and will not result in job failure.

To enable decommissioning, you set Spark configuration settings and environment variables when you create a cluster:

  • To enable decommissioning for applications:

    spark.decommission.enabled true
    
  • To enable shuffle data migration during decommissioning:

    spark.storage.decommission.enabled true
    spark.storage.decommission.shuffleBlocks.enabled true
    
  • To enable RDD cache data migration during decommissioning:

    Note

    When RDD StorageLevel replication is set to more than 1, Databricks does not recommend enabling RDD data migration since the replicas ensure RDDs will not lose data.

    spark.storage.decommission.enabled true
    spark.storage.decommission.rddBlocks.enabled true
    
  • To enable decommissioning for workers:

    SPARK_WORKER_OPTS="-Dspark.decommission.enabled=true"
    

To set these custom Spark configuration properties:

  1. On the New Cluster page, click the Advanced Options toggle.

  2. Click the Spark tab.

    Decommission Config

To access a worker’s decommission status from the UI, navigate to the Spark Cluster UI - Master tab:

Decommission Worker In UI

When the decommissioning finishes, the executor that decommissioned shows the loss reason in the Spark UI > Executors tab on the cluster’s details page:

Decommission Executor In UI

Decommission Executor In Timeline