Compute configuration best practices

This article describes recommendations for setting optional compute configurations. To reduce configuration decisions, Azure Databricks recommends taking advantage of both serverless compute and compute policies.

  • Serverless compute does not require configuring compute settings. Serverless compute is always available and scales according to your workload. See Types of compute.

  • Compute policies let you create preconfigured compute designed for specific use cases like personal compute, shared compute, power users, and jobs. If you don’t have access to the policies, contact your workspace admin. See Default policies and policy families.

If you choose to create compute with your own configurations, the sections below provide recommendations for typical use cases.

Note

This article assumes that you have unrestricted cluster creation. Workspace admins should only grant this privilege to advanced users.

Compute sizing considerations

People often think of compute size in terms of the number of workers, but there are other important factors to consider:

  • Total executor cores (compute): The total number of cores across all executors. This determines the maximum parallelism of a compute.
  • Total executor memory: The total amount of RAM across all executors. This determines how much data can be stored in memory before spilling it to disk.
  • Executor local storage: The type and amount of local disk storage. Local disk is primarily used in the case of spills during shuffles and caching.

Additional considerations include worker instance type and size, which also influence the factors above. When sizing your compute, consider:

  • How much data will your workload consume?
  • What’s the computational complexity of your workload?
  • Where are you reading data from?
  • How is the data partitioned in external storage?
  • How much parallelism do you need?

Answering these questions will help you determine optimal compute configurations based on workloads.

There’s a balancing act between the number of workers and the size of worker instance types. Configuring compute with two workers, each with 40 cores and 100 GB of RAM, has the same compute and memory as configuring compute with 10 cores and 25 GB of RAM.

Compute sizing examples

The following examples show compute recommendations based on specific types of workloads. These examples also include configurations to avoid and why those configurations are not suitable for the workload types.

Data analysis

Data analysts typically perform processing requiring data from multiple partitions, leading to many shuffle operations. Compute with a smaller number of nodes can reduce the network and disk I/O needed to perform these shuffles.

If you are writing only SQL, the best option for data analysis will be a serverless SQL warehouse.

Note

If your workspace is enabled for the serverless compute public preview, you can use serverless compute to run analysis in Python or SQL. See Serverless compute for notebooks.

If you must configure a new compute, a single-node compute with a large VM type is likely the best choice, particularly for a single analyst.

Analytical workloads will likely require reading the same data repeatedly, so recommended node types are storage optimized with disk cache enabled.

Additional features recommended for analytical workloads include:

  • Enable auto termination to ensure compute is terminated after a period of inactivity.
  • Consider enabling autoscaling based on the analyst’s typical workload.
  • Consider using pools, which will allow restricting compute to pre-approved instance types and ensure consistent compute configurations.

Basic batch ETL

Note

If your workspace is enabled for serverless compute for workflows (Public Preview), you can use serverless compute to run your jobs. See Serverless compute for notebooks.

Simple batch ETL jobs that don’t require wide transformations, such as joins or aggregations, typically benefit from compute-optimized worker types.

Compute-optimized workers have lower requirements for memory and storage and might result in cost savings over other worker types.

Complex batch ETL

Note

If your workspace is enabled for serverless compute for workflows (Public Preview), you can use serverless compute to run your jobs. See Serverless compute for notebooks.

For a complex ETL job, such as one that requires unions and joins across multiple tables, Databricks recommends reducing the number of workers to reduce the amount of data shuffled.

Complex transformations can be compute-intensive. If you observe significant spill to disk or OOM errors, you should add additional nodes.

Databricks recommends compute-optimized worker types. Compute-optimized workers have lower requirements for memory and storage and might result in cost savings over other worker types. Optionally, use pools to decrease compute launch times and reduce total runtime when running job pipelines.

Training machine learning models

Databricks recommends single node compute with a large node type for initial experimentation with training machine learning models. Having fewer nodes reduces the impact of shuffles.

Adding more workers can help with stability, but you should avoid adding too many workers because of the overhead of shuffling data.

Recommended worker types are storage optimized with disk caching enabled to account for repeated reads of the same data and to enable caching of training data. If the compute and storage options provided by storage optimized nodes are not sufficient, consider GPU optimized nodes. A possible downside is the lack of disk caching support with these nodes.

Additional features recommended for machine learning workloads include:

  • Enable auto termination to ensure compute is terminated after a period of inactivity.
  • Use pools, which will allow restricting compute to pre-approved instance types and ensure consistent compute configurations.