How to Use Microsoft Machine Learning Library for Apache Spark


This article is deprecated. Support for earlier versions of this service will end incrementally. View the support timeline. Start using the latest version with this quickstart.


Microsoft Machine Learning Library for Apache Spark (MMLSpark) provides tools that let you easily create scalable machine learning models for large datasets. It includes integration of SparkML pipelines with the Microsoft Cognitive Toolkit and OpenCV, enabling you to:

  • Ingress and pre-process image data
  • Featurize images and text using pre-trained deep learning models
  • Train and score classification and regression models using implicit featurization.


To step through this how-to guide, you need to:

Run Your Experiment in Docker Container

Your Azure Machine Learning Workbench is configured to use MMLSpark when you run experiments in Docker container, either locally or in remote VM. This capability allows you to easily debug and experiment with your PySpark models, before running them on scale on a cluster.

To get started using an example, create a new project, and select "MMLSpark on Adult Census" Gallery example. Select "Docker" as the compute context from the project dashboard, and click "Run." Azure Machine Learning Workbench builds the Docker container to run the PySpark program, and then executes the program.

After the run has completed, you can view the results in run history view of Azure Machine Learning Workbench.

Install MMLSpark on Azure HDInsight Spark Cluster.

To complete this and the following step, you need to first create an Azure HDInsight Spark cluster.

By default, Azure Machine Learning Workbench installs MMLSpark package on your cluster when you run your experiment. You can control this behavior and install other Spark packages by editing a file named aml_config/spark_dependencies.yml in your project folder.

# Spark configuration properties.
  "": "Azure ML Experiment"
  "spark.yarn.maxAppAttempts": 1

  - ""
  - group: ""
    artifact: "mmlspark_2.11"
    version: "0.9.9"

You can also install MMLSpark directly on your HDInsight Spark cluster using Script Action.

Set up Azure HDInsight Spark Cluster as Compute Target

Open CLI window from Azure Machine Learning Workbench by going to "File" Menu and click "Open Command Prompt"

In CLI Window, run following commands:

az ml computetarget attach cluster --name <myhdi> --address <> --username <sshusername> --password <sshpwd> 
az ml experiment prepare -c <myhdi>

Now the cluster is available as compute target for the project.

Run experiment on Azure HDInsight Spark cluster.

Go back to the project dashboard of "MMLSpark on Adult Census" example. Select your cluster as the compute target, and click run.

Azure Machine Learning Workbench submits the spark job to the cluster. You can monitor the progress and view the results in run history view.

Next steps

For information about MMLSpark library, and examples, see MMLSpark GitHub repository

Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.