Deep Learning Pipelines

Deep Learning Pipelines is a high-level deep learning framework that facilitates common deep learning workflows via the Apache Spark MLlib Pipelines API and scales out deep learning on big data using Spark. It is an open source project and employs the Apache 2.0 License. For details about the library, refer to the Deep Learning Pipelines GitHub page.

Deep Learning Pipelines calls into lower-level deep learning libraries. It currently supports TensorFlow and Keras with the TensorFlow-backend.


The Deep Learning Pipelines library is included in Databricks Runtime ML, a machine learning runtime that provides a ready-to-go environment for machine learning and data science. Instead of installing Deep Learning Pipelines using the instructions in the “Cluster setup” section of the notebook below, you can simply create a cluster using Databricks Runtime ML. See Databricks Runtime for Machine Learning.

Migration guide for Databricks Runtime 6.2 ML and above


Parts of the Deep Learning Pipelines library sparkdl are deprecated. Specifically, the Transformers and Estimators used in Apache Spark ML pipelines are deprecated in Databricks Runtime 6.2 ML and are scheduled to be removed in Databricks Runtime 7.0 ML. See the following sections for migration tips and workarounds.

Reading images

Deep Learning Pipelines includes an image reader sparkdl.image.imageIO, which is deprecated in Databricks Runtime 6.2 ML.

Instead, use the image data source or binary file data source from Apache Spark. Many of the example notebooks in the Deep Learning documentation show use cases of these two data sources.

Distributed hyperparameter tuning

Deep Learning Pipelines includes a Spark ML Estimator sparkdl.KerasImageFileEstimator for tuning hyperparameters using Spark ML tuning utilities. KerasImageFileEstimator is deprecated in Databricks Runtime 6.2 ML.

Instead, use Hyperopt to distribute hyperparameter tuning for deep learning models.

Distributed inference

Deep Learning Pipelines includes several Spark ML Transformers for distributing inference, all of which are deprecated in Databricks Runtime 6.2 ML:

  • DeepImagePredictor
  • TFImageTransformer
  • KerasImageFileTransformer
  • TFTransformer
  • KerasTransformer

Instead, use pandas UDFs to run inference on Spark DataFrames, following the examples in the Model Inference articles.

Deploying models as SQL UDFs

Deep Learning Pipelines includes a utility sparkdl.udf.keras_image_model.registerKerasImageUDF for deploying a deep learning model as a UDF callable from Spark SQL. registerKerasImageUDF is deprecated in Databricks Runtime 6.2 ML.

Instead, use MLflow to export the model as a UDF, following the MLflow model inference example.