启用 GPU 的群集 GPU-enabled clusters

备注

某些启用 GPU 的实例类型在 Beta 中,在群集创建过程中选择驱动程序和辅助角色类型时,会在下拉列表中进行标记。Some GPU-enabled instance types are in Beta and are marked as such in the drop-down list when you select the driver and worker types during cluster creation.

概述Overview

Azure Databricks 支持通过图形处理单元( (Gpu) )加速的群集。Azure Databricks supports clusters accelerated with graphics processing units (GPUs). 本文介绍如何创建启用了 GPU 的实例的群集,并介绍了在这些实例上安装的 GPU 驱动程序和库。This article describes how to create clusters with GPU-enabled instances and describes the GPU drivers and libraries installed on those instances.

若要详细了解启用了 GPU 的群集上的深入了解,请参阅 深度学习To learn more about deep learning on GPU-enabled clusters, see Deep learning.

创建 GPU 群集Create a GPU cluster

创建 GPU 群集类似于创建任何 Spark 群集 (请参阅 群集) 。Creating a GPU cluster is similar to creating any Spark cluster (See Clusters). 应注意以下事项:You should keep in mind the following:

  • Databricks Runtime 版本必须是启用了 gpu 的版本,例如**运行时 6.6 ML (GPU、Scala 2.11、Spark 2.4.5) **。The Databricks Runtime Version must be a GPU-enabled version, such as Runtime 6.6 ML (GPU, Scala 2.11, Spark 2.4.5).
  • 辅助角色类型驱动程序类型必须为 GPU 实例类型。The Worker Type and Driver Type must be GPU instance types.
  • 对于没有 Spark 的单计算机工作流,可以将辅助角色数设置为零。For single-machine workflows without Spark, you can set the number of workers to zero.

Azure Databricks 支持 NC 实例类型系列: NC12NC24 和 NCv3 实例类型系列: NC6s_v3NC12s_v3NC24s_v3Azure Databricks supports the NC instance type series: NC12 and NC24 and the NCv3 instance type series: NC6s_v3, NC12s_v3, and NC24s_v3. 有关支持的 GPU 实例类型及其可用性区域的最新列表,请参阅 Azure Databricks 定价See Azure Databricks Pricing for an up-to-date list of supported GPU instance types and their availability regions. 你的 Azure Databricks 部署必须位于受支持的区域,才能启动启用 GPU 的群集。Your Azure Databricks deployment must reside in a supported region to launch GPU-enabled clusters.

GPU 调度GPU scheduling

Databricks Runtime 7.0 ML 和更高版本支持从 Apache Spark 3.0 进行 GPU 感知计划Databricks Runtime 7.0 ML and above support GPU-aware scheduling from Apache Spark 3.0. Azure Databricks 将其预配置 GPU 群集。Azure Databricks preconfigures it on GPU clusters for you.

spark.task.resource.gpu.amount 是唯一与可能需要更改 GPU 感知计划相关的 Spark 配置。spark.task.resource.gpu.amount is the only Spark config related to GPU-aware scheduling that you might need to change. 默认配置为每个任务使用一个 GPU,这对于分布式推理工作负荷和分布式培训非常理想,如果使用所有 GPU 节点。The default configuration uses one GPU per task, which is ideal for distributed inference workloads and distributed training, if you use all GPU nodes. 如果要在节点子集上进行分布式培训,这有助于减少分布式培训过程中的通信开销,Databricks 建议将设置 spark.task.resource.gpu.amount 为群集 Spark 配置中每个辅助节点的 gpu 数。If you want to do distributed training on a subset of nodes, which helps reduce communication overhead during distributed training, Databricks recommends setting spark.task.resource.gpu.amount to the number of GPUs per worker node in the cluster Spark configuration.

对于 PySpark 任务,Azure Databricks 自动将分配的 GPU (s) 重新映射到索引0,1,...。For PySpark tasks, Azure Databricks automatically remaps assigned GPU(s) to indices 0, 1, …. 在默认配置下,每个任务使用一个 GPU,你的代码可以只使用默认 GPU,而不检查分配给任务的 GPU。Under the default configuration that uses one GPU per task, your code can simply use the default GPU without checking which GPU is assigned to the task. 如果为每个任务设置了多个 Gpu (例如4个),你的代码可以假设已分配 Gpu 的索引始终为0、1、2和3。If you set multiple GPUs per task, for example 4, your code can assume that the indices of the assigned GPUs are always 0, 1, 2, and 3. 如果需要已分配 Gpu 的物理索引,可以从 CUDA_VISIBLE_DEVICES 环境变量获取它们。If you do need the physical indices of the assigned GPUs, you can get them from the CUDA_VISIBLE_DEVICES environment variable.

如果你使用 Scala,则可以从中获取分配给任务的 Gpu 的索引 TaskContext.resources().get("gpu")If you use Scala, you can get the indices of the GPUs assigned to the task from TaskContext.resources().get("gpu").

对于低于7.0 的 Databricks Runtime 版本,为了避免尝试使用同一 GPU 的多个 Spark 任务之间发生冲突,Azure Databricks 会自动配置 GPU 群集,以便每个节点最多有一个正在运行的任务。For Databricks Runtime releases below 7.0, in order to avoid conflicts among multiple Spark tasks trying to use the same GPU, Azure Databricks automatically configures GPU clusters such that there is at most one running task per node. 在这种情况下,任务可以使用节点上的所有 Gpu,而不会与其他任务发生冲突。In this case, the task can use all GPUs on the node without running into conflicts with other tasks.

NVIDIA GPU 驱动程序、CUDA 和 cuDNN NVIDIA GPU driver, CUDA, and cuDNN

Azure Databricks 在 Spark 驱动程序和辅助角色实例上安装使用 Gpu 所需的 NVIDIA 驱动程序和库:Azure Databricks installs the NVIDIA driver and libraries required to use GPUs on Spark driver and worker instances:

  • CUDA 工具包,安装在下 /usr/local/cudaCUDA Toolkit, installed under /usr/local/cuda.
  • cuDNN: NVIDIA CUDA Deep 神经网络库。cuDNN: NVIDIA CUDA Deep Neural Network Library.
  • NCCL: NVIDIA 集体通信库。NCCL: NVIDIA Collective Communications Library.

随附的 NVIDIA 驱动程序版本为440.64。The version of the NVIDIA driver included is 440.64. 有关所包含的库版本,请参阅所使用的特定 Databricks Runtime 版本的 发行说明For the versions of the libraries included, see the release notes for the specific Databricks Runtime version you are using.

备注

此软件包含 NVIDIA Corporation 提供的源代码。This software contains source code provided by NVIDIA Corporation. 具体而言,为支持 Gpu,Azure Databricks 包括 CUDA 示例中的代码。Specifically, to support GPUs, Azure Databricks includes code from CUDA Samples.

NVIDIA 最终用户许可协议 (EULA) NVIDIA End User License Agreement (EULA)

当你在 Azure Databricks 中选择启用了 GPU 的 "Databricks Runtime 版本" 时,你将隐式同意与 CUDA、cuDNN 和 Tesla 库有关的 NVIDIA EULA 中所述的条款和条件,并将 Nvidia 最终用户许可协议 (与 NCCL 库的 NCCL 补充) 一起使用 When you select a GPU-enabled “Databricks Runtime Version” in Azure Databricks, you implicitly agree to the terms and conditions outlined in the NVIDIA EULA with respect to the CUDA, cuDNN, and Tesla libraries, and the NVIDIA End User License Agreement (with NCCL Supplement) for the NCCL library.

GPU 群集上的 Databricks 容器服务Databricks Container Services on GPU clusters

重要

此功能目前以公共预览版提供。This feature is in Public Preview.

可以在具有 Gpu 的群集上使用 Databricks 容器服务 ,以使用自定义库创建可移植的深度学习环境。You can use Databricks Container Services on clusters with GPUs to create portable deep learning environments with customized libraries. 有关说明,请参阅 自定义容器和 Databricks 容器服务Refer to Customize containers with Databricks Container Services for instructions.

Databricks Runtime Docker 中心 包含具有 GPU 功能的示例基本映像。The Databricks Runtime Docker Hub contains example base images with GPU capability. 用于生成这些映像的 Dockerfile 位于 示例容器 GitHub 存储库中,该存储库还详细介绍了示例图像提供的内容以及如何对其进行自定义。The Dockerfiles used to generate these images are located in the example containers GitHub repository, which also has details on what the example images provide, and how to customize them.

为 GPU 群集创建自定义映像时,无法更改 NVIDIA 驱动程序版本。When creating custom images for GPU clusters, you cannot change the NVIDIA driver version. NVIDIA 驱动程序版本必须与主计算机上的驱动程序版本(440.64)相匹配。The NVIDIA driver version must match the driver version on the host machine, which is 440.64. 此版本不支持 CUDA 11。This version does not support CUDA 11.