GPU-enabled clusters

Note

Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver and worker types during cluster creation.

Overview

Azure Databricks supports clusters accelerated with graphics processing units (GPUs). This article describes how to create clusters with GPU-enabled instances and describes the GPU drivers and libraries installed on those instances.

To learn more about deep learning on GPU-enabled clusters, see Deep learning.

Create a GPU cluster

Creating a GPU cluster is similar to creating any Spark cluster (See Clusters). You should keep in mind the following:

  • The Databricks Runtime Version must be a GPU-enabled version, such as Runtime 6.6 ML (GPU, Scala 2.11, Spark 2.4.5).
  • The Worker Type and Driver Type must be GPU instance types.
  • For single-machine workflows without Spark, you can set the number of workers to zero.

Azure Databricks supports the NC instance type series: NC12 and NC24 and the NCv3 instance type series: NC6s_v3, NC12s_v3, and NC24s_v3. See Azure Databricks Pricing for an up-to-date list of supported GPU instance types and their availability regions. Your Azure Databricks deployment must reside in a supported region to launch GPU-enabled clusters.

GPU scheduling

Databricks Runtime 7.0 ML and above support GPU-aware scheduling from Apache Spark 3.0. Azure Databricks preconfigures it on GPU clusters for you.

spark.task.resource.gpu.amount is the only Spark config related to GPU-aware scheduling that you might need to change. The default configuration uses one GPU per task, which is ideal for distributed inference workloads and distributed training, if you use all GPU nodes. If you want to do distributed training on a subset of nodes, which helps reduce communication overhead during distributed training, Databricks recommends setting spark.task.resource.gpu.amount to the number of GPUs per worker node in the cluster Spark configuration.

For PySpark tasks, Azure Databricks automatically remaps assigned GPU(s) to indices 0, 1, …. Under the default configuration that uses one GPU per task, your code can simply use the default GPU without checking which GPU is assigned to the task. If you set multiple GPUs per task, for example 4, your code can assume that the indices of the assigned GPUs are always 0, 1, 2, and 3. If you do need the physical indices of the assigned GPUs, you can get them from the CUDA_VISIBLE_DEVICES environment variable.

If you use Scala, you can get the indices of the GPUs assigned to the task from TaskContext.resources().get("gpu").

For Databricks Runtime releases below 7.0, in order to avoid conflicts among multiple Spark tasks trying to use the same GPU, Azure Databricks automatically configures GPU clusters such that there is at most one running task per node. In this case, the task can use all GPUs on the node without running into conflicts with other tasks.

NVIDIA GPU driver, CUDA, and cuDNN

Azure Databricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker instances:

  • CUDA Toolkit, installed under /usr/local/cuda.
  • cuDNN: NVIDIA CUDA Deep Neural Network Library.
  • NCCL: NVIDIA Collective Communications Library.

The version of the NVIDIA driver included is 440.64. For the versions of the libraries included, see the release notes for the specific Databricks Runtime version you are using.

Note

This software contains source code provided by NVIDIA Corporation. Specifically, to support GPUs, Azure Databricks includes code from CUDA Samples.

NVIDIA End User License Agreement (EULA)

When you select a GPU-enabled “Databricks Runtime Version” in Azure Databricks, you implicitly agree to the terms and conditions outlined in the NVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries, and the NVIDIA End User License Agreement (with NCCL Supplement) for the NCCL library.

Databricks Container Services on GPU clusters

Important

This feature is in Public Preview.

You can use Databricks Container Services on clusters with GPUs to create portable deep learning environments with customized libraries. Refer to Customize containers with Databricks Container Services for instructions.

The Databricks Runtime Docker Hub contains example base images with GPU capability. The Dockerfiles used to generate these images are located in the example containers GitHub repository, which also has details on what the example images provide, and how to customize them.

When creating custom images for GPU clusters, you cannot change the NVIDIA driver version. The NVIDIA driver version must match the driver version on the host machine, which is 440.64. This version does not support CUDA 11.