Hyperopt with HorovodRunner and Apache Spark MLlib

Hyperopt is typically used to optimize objective functions that can be evaluated on a single machine. However, you can also use Hyperopt to optimize objective functions that require distributed training to evaluate. In this case, Hyperopt generates trials representing different hyperparameter settings on the driver node, and each trial is evaluated via distributed training algorithms that can use your full cluster. This setup applies to any distributed machine learning algorithms or libraries, including:

  • Apache Spark MLlib: the Apache Spark scalable machine learning library consisting of common learning algorithms and utilities.
  • HorovodRunner: a general API to run distributed deep learning workloads on |Databricks| using the Horovod framework. By integrating Horovod with Spark‚Äôs barrier mode, Azure Databricks is able to provide higher stability for long-running deep learning training jobs on Spark.

The following section demonstrates an example of hyperparameter tuning for distributed training using Hyperopt with HorovodRunner.

How to use Hyperopt with HorovodRunner

At a high level, you just need to launch HorovodRunner in the distributed mode, that is, HorovodRunner(np>0), in the objective function you pass to Hyperopt.

The description of low-level technical details is as follows. In Hyperopt, all trials are evaluated sequentially on the Spark driver node. In contrast, HorovodRunner is launched on the Spark driver node, and it distributes training jobs to Spark worker nodes. HorovodRunner collects the return values to the driver node and then passes them to Hyperopt.

Note

Azure Databricks does not support automatic logging to MLflow with regular trials, so you must manually call MLflow to log trials for Hyperopt.

Hyperopt and HorovodRunner distributed training notebook

Get notebook