Distributed training

Article
03/01/2024

When possible, Azure Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine. For these workloads, Databricks Runtime ML includes the TorchDistributor, Horovod and spark-tensorflow-distributor packages.

Azure Databricks also offers distributed training for Spark ML models with the pyspark.ml.connect module, see Train Spark ML models on Databricks Connect with pyspark.ml.connect.

Note

Databricks does not recommend running multi-node distributed training using NC-series VMs due to low inter-node network performance. Instead, use one multi-GPU node, or use a different GPU VM size such as the NCasT4_v3-series, which supports accelerated networking.

DeepSpeed distributor

The DeepSpeed distributor is built on top of TorchDistributor and is a recommended solution for customers with models that require higher compute power, but are limited by memory constraints. DeepSpeed is an open-source library developed by Microsoft and offers optimized memory usage, reduced communication overhead, and advanced pipeline parallelism. Learn more about Distributed training with DeepSpeed distributor

TorchDistributor

TorchDistributor is an open-source module in PySpark that helps users do distributed training with PyTorch on their Spark clusters, so it lets you launch PyTorch training jobs as Spark jobs. Under-the-hood, it initializes the environment and the communication channels between the workers and utilizes the CLI command torch.distributed.run to run distributed training across the worker nodes. Learn more about Distributed training with TorchDistributor.

spark-tensorflow-distributor

spark-tensorflow-distributor is an open-source native package in TensorFlow for distributed training with TensorFlow on Spark clusters. Learn more about Distributed training with TensorFlow 2.

Ray

Ray is an open-source framework that specializes in parallel compute processing for scaling ML workflows and AI applications. See Use Ray on Azure Databricks.

Horovod

Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. Azure Databricks supports distributed deep learning training using HorovodRunner and the horovod.spark package. For Spark ML pipeline applications using Keras or PyTorch, you can use the horovod.spark estimator API.

Requirements

Databricks Runtime ML.

Use Horovod

The following articles provide general information about distributed deep learning with Horovod and example notebooks illustrating how to use HorovodRunner and the horovod.spark package.

Install a different version of Horovod

To upgrade or downgrade Horovod from the pre-installed version in your ML cluster, you must recompile Horovod by following these steps:

Uninstall the current version of Horovod.

%pip uninstall -y horovod

If using a GPU-accelerated cluster, install CUDA development libraries required to compile Horovod. To ensure compatibility, leave the package versions unchanged.

%sh
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"

wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
dpkg -i ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

apt-get update
apt-get install --allow-downgrades --no-install-recommends -y \
cuda-nvml-dev-11-0=11.0.167-1 \
cuda-nvcc-11-0=11.0.221-1 \
cuda-cudart-dev-11-0=11.0.221-1 \
cuda-libraries-dev-11-0=11.0.3-1 \
libnccl-dev=2.11.4-1+cuda11.5\
libcusparse-dev-11-0=11.1.1.245-1

Download the desired version of Horovod’s source code and compile with the appropriate flags. If you don’t need any of the extensions (such as HOROVOD_WITH_PYTORCH), you can remove those flags.

Cpu

%sh
HOROVOD_VERSION=v0.21.3 # Change as necessary
git clone --recursive https://github.com/horovod/horovod.git --branch ${HOROVOD_VERSION}
cd horovod
rm -rf build/ dist/
HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 \
# For Databricks Runtime 8.4 ML and below, replace with /databricks/conda/envs/databricks-ml/bin/python
sudo /databricks/python3/bin/python setup.py bdist_wheel
readlink -f dist/horovod-*.whl

Gpu

%sh
HOROVOD_VERSION=v0.21.3 # Change as necessary
git clone --recursive https://github.com/horovod/horovod.git --branch ${HOROVOD_VERSION}
cd horovod
rm -rf build/ dist/
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_CUDA_HOME=/usr/local/cuda HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 \
# For Databricks Runtime 8.4 ML and below, replace with /databricks/conda/envs/databricks-ml-gpu/bin/python
sudo /databricks/python3/bin/python setup.py bdist_wheel
readlink -f dist/horovod-*.whl

Use %pip to reinstall Horovod by specifying the Python wheel path from the previous command’s output. 0.21.3 is shown in this example.

%pip install --no-cache-dir /databricks/driver/horovod/dist/horovod-0.21.3-cp38-cp38-linux_x86_64.whl

Troubleshoot Horovod installation

Problem: Importing horovod.{torch|tensorflow} raises ImportError: Extension horovod.{torch|tensorflow} has not been built

Solution: Horovod comes pre-installed on Databricks Runtime ML, so this error typically occurs if updating an environment goes wrong. The error indicates that Horovod was installed before a required library (PyTorch or TensorFlow). Since Horovod is compiled during installation, horovod.{torch|tensorflow} will not get compiled if those packages aren’t present during the installation of Horovod. To fix the issue, follow these steps:

Verify that you are on a Databricks Runtime ML cluster.
Ensure that the PyTorch or TensorFlow package is already installed.
Uninstall Horovod (%pip uninstall -y horovod).
Install cmake (%pip install cmake).
Reinstall horovod.