Best practices: Cluster configuration

Azure Databricks provides a number of options when you create and configure clusters to help you get the best performance at the lowest cost. This flexibility, however, can create challenges when you’re trying to determine optimal configurations for your workloads. Carefully considering how users will utilize clusters will help guide configuration options when you create new clusters or configure existing clusters. Some of the things to consider when determining configuration options are:

  • What type of user will be using the cluster? A data scientist may be running different job types with different requirements than a data engineer or data analyst.
  • What types of workloads will users run on the cluster? For example, batch extract, transform, and load (ETL) jobs will likely have different requirements than analytical workloads.
  • What level of service level agreement (SLA) do you need to meet?
  • What budget constraints do you have?

This article provides cluster configuration recommendations for different scenarios based on these considerations. This article also discusses specific features of Azure Databricks clusters and the considerations to keep in mind for those features.

Your configuration decisions will require a tradeoff between cost and performance. The primary cost of a cluster includes the Databricks Units (DBUs) consumed by the cluster and the cost of the underlying resources needed to run the cluster. What may not be obvious are the secondary costs such as the cost to your business of not meeting an SLA, decreased employee efficiency, or possible waste of resources because of poor controls.

Cluster features

Before discussing more detailed cluster configuration scenarios, it’s important to understand some features of Azure Databricks clusters and how best to use those features.

All-purpose clusters and job clusters

When you create a cluster you select a cluster type: an all-purpose cluster or a job cluster. All-purpose clusters can be shared by multiple users and are best for performing ad-hoc analysis, data exploration, or development. Once you’ve completed implementing your processing and are ready to operationalize your code, switch to running it on a job cluster. Job clusters terminate when your job ends, reducing resource usage and cost.

Cluster mode

Azure Databricks supports three cluster modes: Standard, High Concurrency, and Single Node. Most regular users use Standard or Single Node clusters.

  • Standard clusters are ideal for processing large amounts of data with Apache Spark.
  • Single Node clusters are intended for jobs that use small amounts of data or non-distributed workloads such as single-node machine learning libraries.
  • High Concurrency clusters are ideal for groups of users who need to share resources or run ad-hoc jobs. Administrators usually create High Concurrency clusters. Databricks recommends enabling autoscaling for High Concurrency clusters.

On-demand and spot instances

To save cost, Azure Databricks supports creating clusters using a combination of on-demand and spot instances. You can use spot instances to take advantage of unused capacity on Azure to reduce the cost of running your applications, grow your application’s compute capacity, and increase throughput.

Autoscaling

Autoscaling allows clusters to resize automatically based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective, but it can be challenging to understand when and how to use autoscaling. The following are some considerations for determining whether to use autoscaling and how to get the most benefit:

  • Autoscaling typically reduces costs compared to a fixed-size cluster.
  • Autoscaling workloads can run faster compared to an under-provisioned fixed-size cluster.
  • Some workloads are not compatible with autoscaling clusters, including spark-submit jobs and some python packages.
  • With single-user all-purpose clusters, users may find autoscaling is slowing down their development or analysis when the minimum number of workers is set too low. This is because the commands or queries they’re running are often several minutes apart, time in which the cluster is idle and may scale down to save on costs. When the next command is executed, the cluster manager will attempt to scale up, taking a few minutes while retrieving instances from the cloud provider. During this time, jobs might run with insufficient resources, slowing the time to retrieve results. While increasing the minimum number of workers helps, it also increases cost. This is another example where cost and performance need to be balanced.
  • If Delta Caching is being used, it’s important to remember that any cached data on a node is lost if that node is terminated. If retaining cached data is important for your workload, consider using a fixed-size cluster.
  • If you have a job cluster running an ETL workload, you can sometimes size your cluster appropriately when tuning if you know your job is unlikely to change. However, autoscaling gives you flexibility if your data sizes increase. It’s also worth noting that optimized autoscaling can reduce expense with long-running jobs if there are long periods when the cluster is underutilized or waiting on results from another process. Once again, though, your job may experience minor delays as the cluster attempts to scale up appropriately. If you have tight SLAs for a job, a fixed-sized cluster may be a better choice or consider using an Azure Databricks pool to reduce cluster start times.

Azure Databricks also supports autoscaling local storage. With autoscaling local storage, Azure Databricks monitors the amount of free disk space available on your cluster’s Spark workers. If a worker begins to run low on disk, Azure Databricks automatically attaches a new managed volume to the worker before it runs out of disk space.

Pools

Pools reduce cluster start and scale-up times by maintaining a set of available, ready-to-use instances. Databricks recommends taking advantage of pools to improve processing time while minimizing cost.

Databricks Runtime versions

Databricks recommends using the latest Databricks Runtime version for all-purpose clusters. Using the most current version will ensure you have the latest optimizations and most up-to-date compatibility between your code and preloaded packages.

For job clusters running operational workloads, consider using the Long Term Support (LTS) Databricks Runtime version. Using the LTS version will ensure you don’t run into compatibility issues and can thoroughly test your workload before upgrading. If you have an advanced use case around machine learning or genomics, consider the specialized Databricks Runtime versions.

Cluster policies

Azure Databricks cluster policies allow administrators to enforce controls over the creation and configuration of clusters. Databricks recommends using cluster policies to help apply the recommendations discussed in this guide. Learn more about cluster policies in the cluster policies best practices guide.

Automatic termination

Many users won’t think to terminate their clusters when they’re finished using them. Fortunately, clusters are automatically terminated after a set period, with a default of 120 minutes.

Administrators can change this default setting when creating cluster policies. Decreasing this setting can lower cost by reducing the time that clusters are idle. It’s important to remember that when a cluster is terminated all state is lost, including all variables, temp tables, caches, functions, objects, and so forth. All of this state will need to be restored when the cluster starts again. If a developer steps out for a 30-minute lunch break, it would be wasteful to spend that same amount of time to get a notebook back to the same state as before.

Important

Idle clusters continue to accumulate DBU and cloud instance charges during the inactivity period before termination.

Garbage collection

While it may be less obvious than other considerations discussed in this article, paying attention to garbage collection can help optimize job performance on your clusters. Providing a large amount of RAM can help jobs perform more efficiently but can also lead to delays during garbage collection.

To minimize the impact of long garbage collection sweeps, avoid deploying clusters with large amounts of RAM configured for each instance. Having more RAM allocated to the executor will lead to longer garbage collection times. Instead, configure instances with smaller RAM sizes, and deploy more instances if you need more memory for your jobs. However, there are cases where fewer nodes with more RAM are recommended, for example, workloads that require a lot of shuffles, as discussed in Cluster sizing considerations.

Cluster access control

You can configure two types of cluster permissions:

  • The Allow Cluster Creation permission controls the ability of users to create clusters.
  • Cluster-level permissions control the ability to use and modify a specific cluster.

To learn more about configuring cluster permissions, see cluster access control.

You can create a cluster if you have either cluster create permissions or access to a cluster policy, which allows you to create any cluster within the policy’s specifications. The cluster creator is the owner and has Can Manage permissions, which will enable them to share it with any other user within the constraints of the data access permissions of the cluster.

Understanding cluster permissions and cluster policies are important when deciding on cluster configurations for common scenarios.

Cluster tags

Cluster tags allow you to easily monitor the cost of cloud resources used by different groups in your organization. You can specify tags as key-value strings when creating a cluster, and Azure Databricks applies these tags to cloud resources, such as instances and EBS volumes. Learn more about tag enforcement in the cluster policies best practices guide.

Cluster sizing considerations

Azure Databricks runs one executor per worker node. Therefore the terms executor and worker are used interchangeably in the context of the Azure Databricks architecture. People often think of cluster size in terms of the number of workers, but there are other important factors to consider:

  • Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a cluster.
  • Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
  • Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.

Additional considerations include worker instance type and size, which also influence the factors above. When sizing your cluster, consider:

  • How much data will your workload consume?
  • What’s the computational complexity of your workload?
  • Where are you reading data from?
  • How is the data partitioned in external storage?
  • How much parallelism do you need?

Answering these questions will help you determine optimal cluster configurations based on workloads. For simple ETL style workloads that use narrow transformations only (transformations where each input partition will contribute to only one output partition), focus on a compute-optimized configuration. If you expect a lot of shuffles, then the amount of memory is important, as well as storage to account for data spills. Fewer large instances can reduce network I/O when transferring data between machines during shuffle-heavy workloads.

There’s a balancing act between the number of workers and the size of worker instance types. A cluster with two workers, each with 40 cores and 100 GB of RAM, has the same compute and memory as an eight worker cluster with 10 cores and 25 GB of RAM.

If you expect many re-reads of the same data, then your workloads may benefit from caching. Consider a storage optimized configuration with Delta Cache.

Cluster sizing examples

The following examples show cluster recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types.

Data analysis

Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. A cluster with a smaller number of nodes can reduce the network and disk I/O needed to perform these shuffles. Cluster A in the following diagram is likely the best choice, particularly for clusters supporting a single analyst.

Cluster D will likely provide the worst performance since a larger number of nodes with less memory and storage will require more shuffling of data to complete the processing.

Data analysis cluster sizing

Analytical workloads will likely require reading the same data repeatedly, so recommended worker types are storage optimized with Delta Cache enabled.

Additional features recommended for analytical workloads include:

  • Enable auto termination to ensure clusters are terminated after a period of inactivity.
  • Consider enabling autoscaling based on the analyst’s typical workload.
  • Consider using pools, which will allow restricting clusters to pre-approved instance types and ensure consistent cluster configurations.

Features that are probably not useful:

  • Storage autoscaling, since this user will probably not produce a lot of data.
  • High Concurrency clusters, since this cluster is for a single user, and High Concurrency clusters are best suited for shared use.

Basic batch ETL

Simple batch ETL jobs that don’t require wide transformations, such as joins or aggregations, typically benefit from clusters that are compute-optimized. For these types of workloads, any of the clusters in the following diagram are likely acceptable.

Basic batch ETL cluster sizing

Compute-optimized worker types are recommended; these will be cheaper, and these workloads will likely not require significant memory or storage.

Using a pool might provide a benefit for clusters supporting simple ETL jobs by decreasing cluster launch times and reducing total runtime when running job pipelines. However, since these types of workloads typically run as scheduled jobs where the cluster runs only long enough to complete the job, using a pool might not provide a benefit.

The following features probably aren’t useful:

  • Delta Caching, since re-reading data is not expected.
  • Auto termination probably isn’t required since these are likely scheduled jobs.
  • Autoscaling is not recommended since compute and storage should be pre-configured for the use case.
  • High Concurrency clusters are intended for multi-users and won’t benefit a cluster running a single job.

Complex batch ETL

More complex ETL jobs, such as processing that requires unions and joins across multiple tables, will probably work best when you can minimize the amount of data shuffled. Since reducing the number of workers in a cluster will help minimize shuffles, you should consider a smaller cluster like cluster A in the following diagram over a larger cluster like cluster D.

Complex ETL cluster sizing

Complex transformations can be compute-intensive, so for some workloads reaching an optimal number of cores may require adding additional nodes to the cluster.

Like simple ETL jobs, compute-optimized worker types are recommended; these will be cheaper, and these workloads will likely not require significant memory or storage. Also, like simple ETL jobs, the main cluster feature to consider is pools to decrease cluster launch times and reduce total runtime when running job pipelines.

The following features probably aren’t useful:

  • Delta Caching, since re-reading data is not expected.
  • Auto termination probably isn’t required since these are likely scheduled jobs.
  • Autoscaling is not recommended since compute and storage should be pre-configured for the use case.
  • High Concurrency clusters are intended for multi-users and won’t benefit a cluster running a single job.

Training machine learning models

Since initial iterations of training a machine learning model are often experimental, a smaller cluster such as cluster A is a good choice. A smaller cluster will also reduce the impact of shuffles.

If stability is a concern, or for more advanced stages, a larger cluster such as cluster B or C may be a good choice.

A large cluster such as cluster D is not recommended due to the overhead of shuffling data between nodes.

Machine learning cluster sizing

Recommended worker types are storage optimized with Delta Caching enabled to account for repeated reads of the same data and to enable caching of training data. If the compute and storage options provided by storage optimized nodes are not sufficient, consider GPU optimized nodes. A possible downside is the lack of Delta Caching support with these nodes.

Additional features recommended for analytical workloads include:

  • Enable auto termination to ensure clusters are terminated after a period of inactivity.
  • Consider enabling autoscaling based on the analyst’s typical workload.
  • Use pools, which will allow restricting clusters to pre-approved instance types and ensure consistent cluster configurations.

Features that are probably not useful:

  • Autoscaling, since cached data can be lost when nodes are removed as a cluster scales down. Additionally, typical machine learning jobs will often consume all available nodes, in which case autoscaling will provide no benefit.
  • Storage autoscaling, since this user will probably not produce a lot of data.
  • High Concurrency clusters, since this cluster is for a single user, and High Concurrency clusters are best suited for shared use.

Common scenarios

The following sections provide additional recommendations for configuring clusters for common cluster usage patterns:

  • Multiple users running data analysis and ad-hoc processing.
  • Specialized use cases like machine learning.
  • Support scheduled batch jobs.

Multi-user clusters

Scenario

You need to provide multiple users access to data for running data analysis and ad-hoc queries. Cluster usage might fluctuate over time, and most jobs are not very resource-intensive. The users mostly require read-only access to the data and want to perform analyses or create dashboards through a simple user interface.

The recommended approach for cluster provisioning is a hybrid approach for node provisioning in the cluster along with autoscaling. A hybrid approach involves defining the number of on-demand instances and spot instances for the cluster and enabling autoscaling between the minimum and the maximum number of instances.

Multi-user scenario

This cluster is always available and shared by the users belonging to a group by default. Enabling autoscaling allows the cluster to scale up and down depending upon the load.

Users do not have access to start/stop the cluster, but the initial on-demand instances are immediately available to respond to user queries. If the user query requires more capacity, autoscaling automatically provisions more nodes (mostly Spot instances) to accommodate the workload.

Azure Databricks has other features to further improve multi-tenancy use cases:

This approach keeps the overall cost down by:

  • Using a shared cluster model.
  • Using a mix of on-demand and spot instances.
  • Using autoscaling to avoid paying for underutilized clusters.

Specialized workloads

Scenario

You need to provide clusters for specialized use cases or teams within your organization, for example, data scientists running complex data exploration and machine learning algorithms. A typical pattern is that a user needs a cluster for a short period to run their analysis.

The best approach for this kind of workload is to create cluster policies with pre-defined configurations for default, fixed, and settings ranges. These settings might include the number of instances, instance types, spot versus on-demand instances, roles, libraries to be installed, and so forth. Using cluster policies allows users with more advanced requirements to quickly spin up clusters that they can configure as needed for their use case and enforce cost and compliance with policies.

Specialized workloads

This approach provides more control to users while maintaining the ability to keep cost under control by pre-defining cluster configurations. This also allows you to configure clusters for different groups of users with permissions to access different data sets.

One downside to this approach is that users have to work with administrators for any changes to clusters, such as configuration, installed libraries, and so forth.

Batch workloads

Scenario

You need to provide clusters for scheduled batch jobs, such as production ETL jobs that perform data preparation. The suggested best practice is to launch a new cluster for each job run. Running each job on a new cluster helps avoid failures and missed SLAs caused by other workloads running on a shared cluster. Depending on the level of criticality for the job, you could use all on-demand instances to meet SLAs or balance between spot and on-demand instances for cost savings.

Scheduled batch workloads