Distributed Training

When possible, Azure Databricks recommends that you train neural networks on a single machine; distributed code for training and inference is more complex than single-machine code and slower due to communication overhead. However, you should consider distributed training and inference if your model or your data are too large to fit in memory on a single machine.

Note

Accelerated networking is not available on the GPU VMs supported by Azure Databricks. Therefore we do not recommend running distributed DL training on a multiple node GPU cluster. You can use a single multi-GPU node or a multiple node CPU cluster for distributed DL training.

Horovod is a distributed training framework, developed by Uber, for TensorFlow, Keras, and PyTorch. The Horovod framework makes it easy to take a single-GPU program and train it on many GPUs.

Azure Databricks supports distributed DL training via the HorovodRunner tool. HorovodRunner simplifies the process of migrating from single-machine TensorFlow, Keras, and PyTorch workloads to multi-GPU machines and multi-node clusters.

Note

HorovodEstimator has been deprecated as of Databricks Runtime 6.2 ML and is scheduled to be removed from Databricks Runtime 7.0 ML. HorovodEstimator is similar to HorovodRunner in providing Horovod support, but it constrains the user to TensorFlow Estimators and Spark ML Pipeline APIs.

These articles contain in-depth discussions HorovodRunner and HorovodEstimator, and example notebooks demonstrating each approach: