Third-party machine learning integrations

This section provides instructions and examples of how to install, configure, and run some of the most popular third-party ML tools in Azure Databricks. Azure Databricks provides these examples on a best-effort basis. Because they are external libraries, they may change in ways that are not easy to predict. If you need additional support for third-party tools, consult the documentation, mailing lists, forums, or other support options provided by the library vendor or maintainer.

H2O Sparkling Water

H2O is an open source project for distributed machine learning. This section describes how to integrate H2O using the Sparkling Water module.


The instruction to Open H2O Flow in browser https://<ipaddress>:54321 (CMD + click in Mac OSX) that appears after connecting to the H2O server requires ssh access to the cluster and is not supported on Azure Databricks.

Python notebook


Databricks Runtime for Machine Learning installs XGBoost, which conflicts with the XGBoost packaged in PySparkling. To use PySparkling on Databricks Runtime ML, you must remove XGBoost using an init script:


rm /databricks/jars/spark--maven-trees--ml--xgboost*

H2O Sparking Water Python notebook

Get notebook

Scala notebook

H2O Sparking Water Scala notebook

Get notebook


scikit-learn, a well-known Python machine learning library, is included in Databricks Runtime. See Databricks runtime release notes for the scikit-learn library version included with your cluster’s runtime.

scikit-learn notebook

Get notebook


XGBoost is a popular machine learning library designed specifically for training decision trees and random forests. You can train XGBoost models on individual machines or in a distributed fashion. Read more in the XGBoost documentation.


If you use XGBoost 0.90 for training and the training job fails, the shared Spark context will be killed and the only way to recover is to restart the cluster. This is a bug in XGBoost.

Install XGBoost

Install XGBoost on Databricks Runtime ML

XGBoost is included in Databricks Runtime ML. You can use these libraries in Databricks Runtime ML without installing any packages. See Databricks Runtime for Machine Learning.

For the version of XGBoost installed in the Databricks Runtime ML version you are using, see the release notes. To install other Python versions in Databricks Runtime ML, install XGBoost as a Databricks PyPI library. Specify it as the following and replace <xgboost version> with the desired version.

xgboost==<xgboost version>

Install XGBoost on Databricks Runtime

  • Python package: Use Databricks Library Utilities by executing the following command in a notebook cell, replacing <xgboost version> with the desired version:

    dbutils.library.installPyPI("xgboost", version="<xgboost version>" )
  • Scala/Java packages: Install as a Databricks library with the Spark Package name xgboost-linux64.

Single node training in Python

The Python package allows you to train only single node workloads.

XGBoost Python notebook

Get notebook

Distributed training in Scala

To perform distributed training, you must use XGBoost’s Scala/Java packages.

Integrate XGBoost with ML pipelines

XGBoost classification notebook

Get notebook

Integrate XGBoost with cross validation

XGBoost regression notebook

Get notebook