Cluster Configurations

This article explains the configuration options available when you create and edit Azure Databricks clusters.

Note

This article focuses on creating and editing clusters using the UI. For information about setting cluster configurations using the Databricks CLI and REST API, see Clusters CLI and Clusters API.

no-alternative-text

Cluster mode

Azure Databricks supports two cluster modes: standard and high concurrency. The default cluster mode is standard.

Note

The cluster configuration includes an auto terminate setting whose default value depends on whether you are creating a standard or high concurrency cluster:

  • Standard clusters are configured to terminate automatically after 120 minutes.
  • High concurrency clusters are configured to not terminate automatically.

Standard clusters

Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.

High concurrency clusters

A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies. This sharing is accomplished with:

  • Preemption: Proactively preempts Spark tasks from over-committed users to ensure all users get their fair share of cluster time and their jobs complete in a timely manner even when contending with dozens of other users. This uses Spark Task Preemption for High Concurrency.
  • Fault isolation: Creates an environment for each notebook, effectively isolating them from one another.

Note

  • High concurrency clusters work only for SQL, Python, and R. The performance, security, and fault isolation of high concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.
  • The Table Access Control checkbox is available only for high concurrency clusters.

To create a high concurrency cluster, in the Cluster Mode drop-down select High Concurrency.

no-alternative-text

For an example of how to create a high concurrency cluster using the Clusters API, see High concurrency cluster example.

Pool

Important

This feature is in Public Preview.

To reduce cluster start time, you can attach a cluster to a predefined pool of idle instances. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

See Use a Pool to learn more about working with pools in Azure Databricks.

Databricks Runtime

Databricks runtimes are the set of core components that run on your clusters. All Databricks runtimes include Apache Spark and add components and updates that improve usability, performance, and security.

Azure Databricks offers several types of runtimes and several versions of those runtime types in the Databricks Runtime Version drop-down when you create or edit a cluster.

For details, see Databricks Runtimes.

Python version

Important

In anticipation of the upcoming end of life of Python 2, announced for 2020, Python 2 is not supported in Databricks Runtime 6.0 and above. Databricks Runtime 5.5 and below continue to support Python 2.

Python clusters running Databricks Runtime 6.0 and above

Databricks Runtime 6.0 and above supports only Python 3. For major changes related to the Python environment introduced by Databricks Runtime 6.0, see Python environment in the release notes.

Python clusters running Databricks Runtime with Conda (Beta)

Databricks Runtime with Conda supports only Python 3. For information about how to use Databricks Runtime with Conda, see Databricks Runtime with Conda.

Python clusters running Databricks Runtime 5.5 and below

For Databricks Runtime 5.5 and below, Spark jobs, Python notebook cells, and library installation all support both Python 2 and 3.

The default Python version for clusters created using the UI is Python 3. In Databricks Runtime 5.5 and below the default version for clusters created using the REST API is Python 2.

Specify Python version

To specify the Python version when you create a cluster using the UI, select it from the Python Version drop-down.

no-alternative-text

To specify the Python version when you create a cluster using the API, set the environment variable PYSPARK_PYTHON to /databricks/python/bin/python or /databricks/python3/bin/python3. For an example, see the REST API example Create a Python 3 cluster (Databricks Runtime 5.5 and below).

To validate that the PYSPARK_PYTHON configuration took effect, in a Python notebook (or %python cell) run:

import sys
print(sys.version)

If you specified /databricks/python3/bin/python3, it should print something like:

3.5.2 (default, Sep 10 2016, 08:21:44)
[GCC 5.4.0 20160609]

Important

For Databricks Runtime 5.5 and below, when you run %sh python --version in a notebook, python refers to the Ubuntu system Python version, which is Python 2. Use /databricks/python/bin/python to refer to the version of Python used by Databricks notebooks and Spark: this path is automatically configured to point to the correct Python executable.

Frequently asked questions (FAQ)

Can I use both Python 2 and Python 3 notebooks on the same cluster?

No. The Python version is a cluster-wide setting and is not configurable on a per-notebook basis.

What libraries are installed on Python clusters?

For details on the specific libraries that are installed, see the Databricks Runtime Release Notes.

Will my existing PyPI libraries work with Python 3?

It depends on whether the version of the library supports the Python 3 version of a Databricks Runtime version. Databricks Runtime 5.5 and below use Python 3.5. Databricks Runtime 6.0 and above, and Databricks Runtime with Conda use Python 3.7. It is possible that a specific old version of a Python library is not forward compatible with Python 3.7. For this case, you will need to use a newer version of the library.

Will my existing .egg libraries work with Python 3?

It depends on whether your existing egg library is cross-compatible with both Python 2 and 3. If the library does not support Python 3 then either library attachment will fail or runtime errors will occur.

For a comprehensive guide on porting code to Python 3 and writing code compatible with both Python 2 and 3, see Supporting Python 3.

Can I still install Python libraries using init scripts?

A common use case for Cluster Node Initialization Scripts is to install packages. For Databricks Runtime 5.5 and below, use /databricks/python/bin/pip to ensure that Python packages install into Databricks Python virtual environment rather than the system Python environment. For Databricks Runtime 6.0 and above, and Databricks Runtime with Conda, the pip command is referring to the pip in the correct Python virtual environment. However, if you are using an init script to create the Python virtual environment, always use the absolute path to access python and pip.

Cluster node type

A cluster consists of one driver node and worker nodes. You can pick separate cloud provider instance types for the driver and worker nodes, although by default the driver node uses the same instance type as the worker node. Different families of instance types fit different use cases, such as memory-intensive or compute-intensive workloads.

Driver node

The driver maintains state information of all notebooks attached to the cluster. The driver node is also responsible for maintaining the SparkContext and interpreting all the commands you run from a notebook or a library on the cluster. The driver node also runs the Apache Spark master that coordinates with the Spark executors.

The default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.

Tip

Since the driver node maintains all of the state information of the notebooks attached, make sure to detach unused notebooks from the driver.

Worker node

Azure Databricks workers run the Spark executors and other services required for the proper functioning of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on workers.

Tip

To run a Spark job, you need at least one worker. If a cluster has zero workers, you can run non-Spark commands on the driver, but Spark commands will fail.

GPU instance types

For computationally challenging tasks that demand high performance, like those associated with deep learning, Azure Databricks supports clusters accelerated with graphics processing units (GPUs). This support is in Beta. For more information, see GPU-enabled Clusters.

Cluster size and autoscaling

When you create a Azure Databricks cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.

When you provide a fixed size cluster, Azure Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.

With autoscaling, Azure Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

  • Workloads can run faster compared to a constant-sized under-provisioned cluster.
  • Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.

Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these benefits at the same time. The cluster size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Azure Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

Note

  • Autoscaling works best with Databricks Runtime 3.4 and above.
  • Autoscaling is not available for spark-submit jobs.

Autoscaling types

Azure Databricks offers two types of cluster node autoscaling: standard and optimized. For a discussion of the benefits of optimized autoscaling, see the blog post on Optimized Autoscaling.

Automated (job) clusters always use optimized autoscaling. The type of autoscaling performed on interactive clusters depends on the workspace configuration.

Standard autoscaling is used by interactive clusters in workspaces in the Standard pricing tier. Optimized autoscaling is used by interactive clusters in the Premium pricing tier.

How autoscaling behaves

Autoscaling behaves differently depending on whether it is optimized or standard and whether applied to an interactive or an automated cluster.

Optimized

  • Scales up from min to max in 2 steps.
  • Can scale down even if the cluster is not idle by looking at shuffle file state.
  • Scales down based on a percentage of current nodes.
  • On automated clusters, scales down if the cluster is underutilized over the last 40 seconds.
  • On interactive clusters, scales down if the cluster is underutilized over the last 150 seconds.

Standard

  • Starts with adding 4 nodes. Thereafter, scales up exponentially, but can take many steps to reach the max.
  • Scales down only when the cluster is completely idle and it has been underutilized for the last 10 minutes.
  • Scales down exponentially, starting with 1 node.

Enable and configure autoscaling

To allow Azure Databricks to resize your cluster automatically, you enable autoscaling for the cluster and provide the min and max range of workers.

  1. Enable autoscaling.

    • Interactive cluster - On the Create Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:

      enable_autoscaling

    • Automated cluster - On the Configure Cluster page, select the Enable autoscaling checkbox in the Autopilot Options box:

    enable_autoscaling

  2. Configure the min and max workers.

enable_autoscaling

Important

If you are using an instance pool:

  • Make sure the cluster size requested is less than or equal to the minimum number of idle instances in the pool. If it is larger, cluster startup time will be equivalent to a cluster that doesn’t use a pool.
  • Make sure the maximum cluster size is less than or equal to the maximum capacity of the pool. If it is larger, the cluster creation will fail.

Autoscaling example

If you reconfigure a static cluster to be an autoscaling cluster, Azure Databricks immediately resizes the cluster within the minimum and maximum bounds and then starts autoscaling. As an example, the table below demonstrates what happens to clusters with a certain initial size if you reconfigure a cluster to autoscale between 5 and 10 nodes.

Initial size Size after reconfiguration
6 6
12 10
3 5

Autoscaling local storage

It can often be difficult to estimate how much disk space a particular job will take. To save you from having to estimate how many gigabytes of managed disk to attach to your cluster at creation time, Azure Databricks automatically enables autoscaling local storage on all Azure Databricks clusters.

With autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run too low on disk, Databricks automatically attaches a new managed disk to the worker before it runs out of disk space. Disks are attached up to a limit of 5 TB of total disk space per virtual machine (including the virtual machine’s initial local storage).

The managed disks attached to a virtual machine are detached only when the virtual machine is returned to Azure. That is, managed disks are never detached from a virtual machine as long as it is part of a running cluster. To scale down managed disk usage, Azure Databricks recommends using this feature in a cluster configured with Cluster size and autoscaling or Automatic termination.

Spark configuration

To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration.

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. Click the Spark tab.

    no-alternative-text

When you configure a cluster using the Clusters API, set Spark properties in the spark_conf field in the Create cluster request or Edit cluster request.

To set Spark properties for all clusters, create a global init script:

%scala
dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
  |#!/bin/bash
  |
  |cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
  |[driver] {
  |  "spark.sql.sources.partitionOverwriteMode" = "DYNAMIC"
  |}
  |EOF
  """.stripMargin, true)

Environment variables

You can set environment variables that you can access from scripts running on a cluster.

  1. On the cluster configuration page, click the Advanced Options toggle.
  2. Click the Spark tab.
  3. Set the environment variables in the Environment Variables field.

no-alternative-text

You can also set environment variables using the spark_env_vars field in the Create cluster request or Edit cluster request Clusters API endpoints.

Note

The environment variables you set in this field are not available in Cluster Node Initialization Scripts. Init scripts support only a limited set of predefined Environment variables.

Cluster tags

Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes.

For convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId. You can add custom tags when you create a cluster. To configure cluster tags:

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. At the bottom of the page, click the Tags tab.

    no-alternative-text

  3. Add a key-value pair for each custom tag. You can add up to 43 custom tags.

Custom tags are displayed on Azure bills and updated whenever you add, edit, or delete a custom tag.

SSH access to clusters

SSH allows you to log into Apache Spark clusters remotely for advanced troubleshooting and installing custom software.

For security reasons, in Azure Databricks the SSH port is closed by default. If you want to enable SSH access to your Spark clusters, contact Azure Databricks support.

Cluster log delivery

When you create a cluster, you can specify a location to deliver Spark driver and worker logs. Logs are delivered every five minutes to your chosen destination. When a cluster is terminated, Azure Databricks guarantees to deliver all logs generated up until the cluster was terminated.

The destination of the logs depends on the cluster ID. If the specified destination is dbfs:/cluster-log-delivery, cluster logs for 0630-191345-leap375 are delivered to dbfs:/cluster-log-delivery/0630-191345-leap375.

To configure the log delivery location:

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. At the bottom of the page, click the Logging tab.

    Cluster_Log_Delivery

  3. Select a destination type.

  4. Enter the cluster log path.

Note

This feature is also available in the REST API. See Clusters API and Cluster log delivery examples.

Init scripts

A cluster node initialization—or init—script is a shell script that runs during startup for each cluster node before the Spark driver or worker JVM starts. You can use init scripts to install packages and libraries not included in the Databricks runtime, modify the JVM system classpath, set system properties and environment variables used by the JVM, or modify Spark configuration parameters, among other configuration tasks.

You can attach init scripts to a cluster by expanding the Advanced Options section and clicking the Init Scripts tab.

For detailed instructions, see Cluster Node Initialization Scripts.