Pipelines and Azure Machine Learning

In this article, learn about the machine learning pipelines you can build with the Azure Machine Learning SDK for Python and the advantages to using pipelines.

What are machine learning pipelines?

Using machine learning (ML) pipelines, data scientists, data engineers, and IT professionals can collaborate on the steps involved in:

  • Data preparation, such as normalizations and transformations
  • Model training
  • Model evaluation
  • Deployment

The following diagram shows an example pipeline:

Machine learning pipelines in Azure Machine Learning service

Why build pipelines with Azure Machine Learning?

The Azure Machine Learning SDK for Python can be used to create ML pipelines as well as to submit and track individual pipeline runs.

With pipelines, you can optimize your workflow with simplicity, speed, portability, and reuse. When building pipelines with Azure Machine Learning, you can focus on what you know best — machine learning — rather than infrastructure.

Using distinct steps makes it possible to rerun only the steps you need as you tweak and test your workflow. A step is a computational unit in the pipeline. As shown in the diagram above, the task of preparing data may involve many steps including, but not limited to, normalization, transformation, validation, and featurization. Data sources and intermediate data are reused across the pipeline, which saves compute time and resources.

Once the pipeline is designed, there is often more fine-tuning around the training loop of the pipeline. When you rerun a pipeline, the run jumps to the steps that need to be rerun, such as an updated training script, and skips what hasn't changed. The same paradigm applies to unchanged scripts used for the execution of the step.

With Azure Machine Learning, you can use various toolkits and frameworks such as Microsoft Cognitive Toolkit or TensorFlow for each step in your pipeline. Azure coordinates between the various compute targets you use so that your intermediate data can be shared with the downstream compute targets easily.

You can track the metrics for your pipeline experiments directly in the Azure portal.

Key advantages

The key advantages to building pipelines for your machine learning workflows is:

Key advantage Description
Unattended runs Schedule a few steps to run in parallel or in sequence in a reliable and unattended manner. Since data prep and modeling can last days or weeks, you can now focus on other tasks while your pipeline is running.
Mixed and diverse compute Use multiple pipelines that are reliably coordinated across heterogeneous and scalable computes and storages. Individual pipeline steps can be run on different compute targets, such as HDInsight, GPU Data Science VMs, and Databricks, to make efficient use of available compute options.
Reusability Pipelines can be templatized for specific scenarios such as retraining and batch scoring. They can be triggered from external systems via simple REST calls.
Tracking and versioning Instead of manually tracking data and result paths as you iterate, use the pipelines SDK to explicitly name and version your data sources, inputs, and outputs as well as manage scripts and data separately for increased productivity.

The Python SDK for pipelines

Use Python to create your ML pipelines. The Azure Machine Learning SDK offers imperative constructs for sequencing and parallelizing the steps in your pipelines when no data dependency is present. You can interact with it in Jupyter notebooks or in another preferred IDE.

Using declarative data dependencies, you can optimize your tasks. The SDK includes a framework of pre-built modules for common tasks such as data transfer and model publishing. The framework can be extended to model your own conventions by implementing custom steps that are reusable across pipelines. Compute targets and storage resources can also be managed directly from the SDK.

Pipelines can be saved as templates and can be deployed to a REST endpoint so you can schedule batch-scoring or retraining jobs.

Check out the Python SDK reference docs for pipelines and the notebook in the next section to see how to build your own.

Example notebooks

The following notebook demonstrates pipelines with Azure Machine Learning: pipeline/pipeline-batch-scoring.ipynb.

Get this notebook:

  1. Azure Notebooks Import sample notebooks into Azure Notebooks. (Your organization may require administrator consent before you can sign in.)

  2. See the README in the imported library for further instructions to run the notebooks.