What are Azure Machine Learning pipelines?
In this article, you learn how a machine learning pipeline helps you build, optimize, and manage your machine learning workflow.
Which Azure pipeline technology should I use?
The Azure cloud provides several types of pipeline, each with a different purpose. The following table lists the different pipelines and what they are used for:
|Scenario||Primary persona||Azure offering||OSS offering||Canonical pipe||Strengths|
|Model orchestration (Machine learning)||Data scientist||Azure Machine Learning Pipelines||Kubeflow Pipelines||Data -> Model||Distribution, caching, code-first, reuse|
|Data orchestration (Data prep)||Data engineer||Azure Data Factory pipelines||Apache Airflow||Data -> Data||Strongly typed movement, data-centric activities|
|Code & app orchestration (CI/CD)||App Developer / Ops||Azure Pipelines||Jenkins||Code + Model -> App/Service||Most open and flexible activity support, approval queues, phases with gating|
What can machine learning pipelines do?
An Azure Machine Learning pipeline is an independently executable workflow of a complete machine learning task. Subtasks are encapsulated as a series of steps within the pipeline. An Azure Machine Learning pipeline can be as simple as one that calls a Python script, so may do just about anything. Pipelines should focus on machine learning tasks such as:
- Data preparation including importing, validating and cleaning, munging and transformation, normalization, and staging
- Training configuration including parameterizing arguments, filepaths, and logging / reporting configurations
- Training and validating efficiently and repeatedly. Efficiency might come from specifying specific data subsets, different hardware compute resources, distributed processing, and progress monitoring
- Deployment, including versioning, scaling, provisioning, and access control
Independent steps allow multiple data scientists to work on the same pipeline at the same time without over-taxing compute resources. Separate steps also make it easy to use different compute types/sizes for each step.
After the pipeline is designed, there is often more fine-tuning around the training loop of the pipeline. When you rerun a pipeline, the run jumps to the steps that need to be rerun, such as an updated training script. Steps that do not need to be rerun are skipped.
With pipelines, you may choose to use different hardware for different tasks. Azure coordinates the various compute targets you use, so your intermediate data seamlessly flows to downstream compute targets.
You can track the metrics for your pipeline experiments directly in Azure portal or your workspace landing page (preview). After a pipeline has been published, you can configure a REST endpoint, which allows you to rerun the pipeline from any platform or stack.
In short, all of the complex tasks of the machine learning lifecycle can be helped with pipelines. Other Azure pipeline technologies have their own strengths. Azure Data Factory pipelines excels at working with data and Azure Pipelines is the right tool for continuous integration and deployment. But if your focus is machine learning, Azure Machine Learning pipelines are likely to be the best choice for your workflow needs.
Many programming ecosystems have tools that orchestrate resource, library, or compilation dependencies. Generally, these tools use file timestamps to calculate dependencies. When a file is changed, only it and its dependents are updated (downloaded, recompiled, or packaged). Azure Machine Learning pipelines extend this concept. Like traditional build tools, pipelines calculate dependencies between steps and only perform the necessary recalculations.
The dependency analysis in Azure Machine Learning pipelines is more sophisticated than simple timestamps though. Every step may run in a different hardware and software environment. Data preparation might be a time-consuming process but not need to run on hardware with powerful GPUs, certain steps might require OS-specific software, you might want to use distributed training, and so forth.
Azure Machine Learning automatically orchestrates all of the dependencies between pipeline steps. This orchestration might include spinning up and down Docker images, attaching and detaching compute resources, and moving data between the steps in a consistent and automatic manner.
Coordinating the steps involved
When you create and run a
Pipeline object, the following high-level steps occur:
- For each step, the service calculates requirements for:
- Hardware compute resources
- OS resources (Docker image(s))
- Software resources (Conda / virtualenv dependencies)
- Data inputs
- The service determines the dependencies between steps, resulting in a dynamic execution graph
- When each node in the execution graph runs:
- The service configures the necessary hardware and software environment (perhaps reusing existing resources)
- The step runs, providing logging and monitoring information to its containing
- When the step completes, its outputs are prepared as inputs to the next step and/or written to storage
- Resources that are no longer needed are finalized and detached
Building pipelines with the Python SDK
In the Azure Machine Learning Python SDK, a pipeline is a Python object defined in the
azureml.pipeline.core module. A Pipeline object contains an ordered sequence of one or more PipelineStep objects. The
PipelineStep class is abstract and the actual steps will be of subclasses such as EstimatorStep, PythonScriptStep, or DataTransferStep. The ModuleStep class holds a reusable sequence of steps that can be shared among pipelines. A
Pipeline runs as part of an
An Azure machine learning pipeline is associated with an Azure Machine Learning workspace and a pipeline step is associated with a compute target available within that workspace. For more information, see Create and manage Azure Machine Learning workspaces in the Azure portal or What are compute targets in Azure Machine Learning?.
A simple Python Pipeline
This snippet shows the objects and calls needed to create and run a
ws = Workspace.from_config() blob_store = Datastore(ws, "workspaceblobstore") compute_target = ws.compute_targets["STANDARD_NC6"] experiment = Experiment(ws, 'MyExperiment') input_data = Dataset.File.from_files( DataPath(datastore, '20newsgroups/20news.pkl')) prepped_data_path = OutputFileDatasetConfig(name="output_path") dataprep_step = PythonScriptStep( name="prep_data", script_name="dataprep.py", source_directory="prep_src", compute_target=compute_target, arguments=["--prepped_data_path", prepped_data_path], inputs=[input_dataset.as_named_input('raw_data').as_mount() ] ) prepped_data = prepped_data_path.read_delimited_files() train_step = PythonScriptStep( name="train", script_name="train.py", compute_target=compute_target, arguments=["--prepped_data", prepped_data], source_directory="train_src" ) steps = [ dataprep_step, train_step ] pipeline = Pipeline(workspace=ws, steps=steps) pipeline_run = experiment.submit(pipeline) pipeline_run.wait_for_completion()
The snippet starts with common Azure Machine Learning objects, a
Datastore, a ComputeTarget, and an
Experiment. Then, the code creates the objects to hold
input_data is an instance of FileDataset and the
prepped_data_path is an instance of OutputFileDatasetConfig. For
OutputFileDatasetConfig the default behavior is to copy the output to the
workspaceblobstore datastore under the path
run-id is the Run's ID and
output-name is an autogenerated value if not specified by the developer.
The data preparation code (not shown), writes delimited files to
prepped_data_path. These outputs from the data preparation step are passed as
prepped_data to the training step.
steps holds the two
train_step. Azure Machine Learning will analyze the data dependency of
prepped_data and run
Then, the code instantiates the
Pipeline object itself, passing in the workspace and steps array. The call to
experiment.submit(pipeline) begins the Azure ML pipeline run. The call to
wait_for_completion() blocks until the pipeline is finished.
To learn more about connecting your pipeline to your data, see the articles Data access in Azure Machine Learning and Moving data into and between ML pipeline steps (Python).
Building pipelines with the designer
Developers who prefer a visual design surface can use the Azure Machine Learning designer to create pipelines. You can access this tool from the Designer selection on the homepage of your workspace. The designer allows you to drag and drop steps onto the design surface.
When you visually design pipelines, the inputs and outputs of a step are displayed visibly. You can drag and drop data connections, allowing you to quickly understand and modify the dataflow of your pipeline.
The key advantages of using pipelines for your machine learning workflows are:
|Unattended runs||Schedule steps to run in parallel or in sequence in a reliable and unattended manner. Data preparation and modeling can last days or weeks, and pipelines allow you to focus on other tasks while the process is running.|
|Heterogenous compute||Use multiple pipelines that are reliably coordinated across heterogeneous and scalable compute resources and storage locations. Make efficient use of available compute resources by running individual pipeline steps on different compute targets, such as HDInsight, GPU Data Science VMs, and Databricks.|
|Reusability||Create pipeline templates for specific scenarios, such as retraining and batch-scoring. Trigger published pipelines from external systems via simple REST calls.|
|Tracking and versioning||Instead of manually tracking data and result paths as you iterate, use the pipelines SDK to explicitly name and version your data sources, inputs, and outputs. You can also manage scripts and data separately for increased productivity.|
|Modularity||Separating areas of concerns and isolating changes allows software to evolve at a faster rate with higher quality.|
|Collaboration||Pipelines allow data scientists to collaborate across all areas of the machine learning design process, while being able to concurrently work on pipeline steps.|
Azure Machine Learning pipelines are a powerful facility that begins delivering value in the early development stages. The value increases as the team and project grows. This article has explained how pipelines are specified with the Azure Machine Learning Python SDK and orchestrated on Azure. You've seen some simple source code and been introduced to a few of the
PipelineStep classes that are available. You should have a sense of when to use Azure Machine Learning pipelines and how Azure runs them.