HorovodEstimator:通过 Horovod 和 Apache Spark MLlib 进行分布式 深度学习 HorovodEstimator: distributed deep learning with Horovod and Apache Spark MLlib

重要

Databricks Runtime 7.0 ML中删除了 HorovodEstimator。HorovodEstimator is removed in Databricks Runtime 7.0 ML. 改为使用 HorovodRunner 进行分布式深度学习培训。Use HorovodRunner for distributed deep learning training instead.

HorovodEstimator 是一种 Apache Spark MLlib 样式的估计器 API,它利用 Uber 开发的 Horovod 框架。HorovodEstimator is an Apache Spark MLlib-style estimator API that leverages the Horovod framework developed by Uber. 它简化了 Spark DataFrames 上深层神经网络的分布式多 GPU 培训,简化了 Spark 中的 ETL 与 TensorFlow 中模型定型的集成。It facilitates distributed, multi-GPU training of deep neural networks on Spark DataFrames, simplifying the integration of ETL in Spark with model training in TensorFlow. 具体而言,HorovodEstimator 通过以下方式简化了通过 Horovod 启动分布式培训:Specifically, HorovodEstimator simplifies launching distributed training with Horovod by:

  • 将训练代码 & 数据分发到群集上的每台计算机Distributing training code & data to each machine on your cluster
  • 在驱动程序和辅助角色之间启用无密码 SSH,并通过 MPI 启动培训Enabling passwordless SSH between the driver and workers, and launching training via MPI
  • 编写自定义数据引入 & 模型导出逻辑Writing custom data-ingest & model-export logic
  • 同时运行模型定型 & 评估Simultaneously running model training & evaluation

要求Requirements

Databricks Runtime ML。Databricks Runtime ML.

可以在启用了两个或更多 CPU 的群集或启用了 GPU 的计算机上运行 HorovodEstimator;建议在 GPU 实例上运行(如果可能)。You can run HorovodEstimator on clusters of two or more CPU or GPU-enabled machines; we recommend running on GPU instances if possible.

HorovodEstimator 需要当前群集上的所有 Gpu 可用;因此,我们不建议在共享群集上使用 API。HorovodEstimator expects all GPUs on the current cluster to be available; thus we do not recommend using the API on shared clusters.

如果使用 Gpu,我们建议不要在与 HorovodEstimator 配合使用的群集上打开任何其他 TensorFlow 会话。If using GPUs, we recommend not opening any other TensorFlow sessions on the same cluster as the one you’re using with HorovodEstimator. 如果打开 TensorFlow 会话,则运行笔记本的 Python 复制将使用 GPU,这会阻止 HorovodEstimator 运行。If you open a TensorFlow session, the Python REPL running your notebook will use a GPU, preventing HorovodEstimator from running. 在这种情况下,你可能需要分离/重新连接笔记本,并重新运行 HorovodEstimator 代码,而无需事先运行任何 TensorFlow 代码。In this case you may need to detach/reattach your notebook, and rerun your HorovodEstimator code without running any TensorFlow code beforehand.

分布式训练Distributed training

HorovodEstimator 是 Spark MLlib 估计器,可用于 Spark MLlib 管道 API,但尚不支持估计器暂留。HorovodEstimator is a Spark MLlib Estimator and can be used with the Spark MLlib Pipelines API, although estimator persistence is not yet supported.

为 HorovodEstimator 返回一个 MLlib 转换器 (可用于数据帧上的分布式推理的 TFTransformer) 。Fitting a HorovodEstimator returns an MLlib Transformer (a TFTransformer) that can be used for distributed inference on a DataFrame. 它还存储模型检查点, (可以用于恢复培训) ,事件文件 (包含定型) 期间记录的度量值,而 tf.SavedModel (可用于将 Spark 之外的推理模型应用于指定的模型目录。It also stores model checkpoints (can be used to resume training), event files (contain metrics logged during training), and a tf.SavedModel (can be used to apply the model for inference outside Spark) into the specified model directory.

HorovodEstimator 不允许容错。HorovodEstimator makes no fault-tolerance guarantees. 如果在训练过程中发生错误,则 HorovodEstimator 不会尝试恢复,不过,你可以重新运行 fit() 以从最新的检查点恢复培训。If an error occurs during training, HorovodEstimator does not attempt to recover, although you can rerun fit() to resume training from the latest checkpoint.

示例Example

以下示例笔记本演示了如何使用 HorovodEstimator 对 MNIST 数据集上的深层神经网络(一大的手写数字数据库)进行定型,如下图所示。The following example notebook demonstrates how to use HorovodEstimator to train a deep neural network on the MNIST dataset, a large database of handwritten digits, shown in the following illustration.

MNIST 数据集MNIST dataset

训练模型来预测数字通常用作机器学习的 "Hello World"。Training a model to predict a digit is commonly used as the “Hello World” of machine learning.

HorovodEstimator 笔记本HorovodEstimator notebook

获取笔记本Get notebook