Define machine learning pipelines in YAML

Learn how to define your machine learning pipelines in YAML. When using the machine learning extension for the Azure CLI, many of the pipeline-related commands expect a YAML file that defines the pipeline.

The following table lists what is and is not currently supported when defining a pipeline in YAML:

Step type Supported?
PythonScriptStep Yes
AdlaStep Yes
AzureBatchStep Yes
DatabricksStep Yes
DataTransferStep Yes
AutoMLStep No
HyperDriveStep No
ModuleStep No
MPIStep No
EstimatorStep No

Pipeline definition

A pipeline definition uses the following keys, which correspond to the Pipelines class:

YAML key Description
name The description of the pipeline.
parameters Parameter(s) to the pipeline.
data_reference Defines how and where data should be made available in a run.
default_compute Default compute target where all steps in the pipeline run.
steps The steps used in the pipeline.

Parameters

The parameters section uses the following keys, which correspond to the PipelineParameter class:

YAML key Description
type The value type of the parameter. Valid types are string, int, float, bool, or datapath.
default The default value.

Each parameter is named. For example, the following YAML snippet defines three parameters named NumIterationsParameter, DataPathParameter, and NodeCountParameter:

pipeline:
    name: SamplePipelineFromYaml
    parameters:
        NumIterationsParameter:
            type: int
            default: 40
        DataPathParameter:
            type: datapath
            default:
                datastore: workspaceblobstore
                path_on_datastore: sample2.txt
        NodeCountParameter:
            type: int
            default: 4

Data reference

The data_references section uses the following keys, which correspond to the DataReference:

YAML key Description
datastore The datastore to reference.
path_on_datastore The relative path in the backing storage for the data reference.

Each data reference is contained in a key. For example, the following YAML snippet defines a data reference stored in the key named employee_data:

pipeline:
    name: SamplePipelineFromYaml
    parameters:
        PipelineParam1:
            type: int
            default: 3
    data_references:
        employee_data:
            datastore: adftestadla
            path_on_datastore: "adla_sample/sample_input.csv"

Steps

Steps define a computational environment, along with the files to run on the environment. To define the type of a step, use the type key:

Step type Description
AdlaStep Runs a U-SQL script with Azure Data Lake Analytics. Corresponds to the AdlaStep class.
AzureBatchStep Runs jobs using Azure Batch. Corresponds to the AzureBatchStep class.
DatabricsStep Adds a Databricks notebook, Python script, or JAR. Corresponds to the DatabricksStep class.
DataTransferStep Transfers data between storage options. Corresponds to the DataTransferStep class.
PythonScriptStep Runs a Python script. Corresponds to the PythonScriptStep class.

ADLA step

YAML key Description
script_name The name of the U-SQL script (relative to the source_directory).
compute_target The Azure Data Lake compute target to use for this step.
parameters Parameters to the pipeline.
inputs Inputs can be InputPortBinding, DataReference, PortDataReference, PipelineData, Dataset, DatasetDefinition, or PipelineDataset.
outputs Outputs can be either PipelineData or OutputPortBinding.
source_directory Directory that contains the script, assemblies, etc.
priority The priority value to use for the current job.
params Dictionary of name-value pairs.
degree_of_parallelism The degree of parallelism to use for this job.
runtime_version The runtime version of the Data Lake Analytics engine.
allow_reuse Determines whether the step should reuse previous results when run again with the same settings.

The following example contains an ADLA Step definition:

pipeline:
    name: SamplePipelineFromYaml
    parameters:
        PipelineParam1:
            type: int
            default: 3
    data_references:
        employee_data:
            datastore: adftestadla
            path_on_datastore: "adla_sample/sample_input.csv"
    default_compute: adlacomp
    steps:
        Step1:
            runconfig: "D:\\Yaml\\default_runconfig.yml"
            parameters:
                NUM_ITERATIONS_2:
                    source: PipelineParam1
                NUM_ITERATIONS_1: 7
            type: "AdlaStep"
            name: "MyAdlaStep"
            script_name: "sample_script.usql"
            source_directory: "D:\\scripts\\Adla"
            inputs:
                employee_data:
                    source: employee_data
            outputs:
                OutputData:
                    destination: Output4
                    datastore: adftestadla
                    bind_mode: mount

Azure Batch step

YAML key Description
compute_target The Azure Batch compute target to use for this step.
inputs Inputs can be InputPortBinding, DataReference, PortDataReference, PipelineData, Dataset, DatasetDefinition, or PipelineDataset.
outputs Outputs can be either PipelineData or OutputPortBinding.
source_directory Directory that contains the module binaries, executable, assemblies, etc.
executable Name of the command/executable that will be ran as part of this job.
create_pool Boolean flag to indicate whether to create the pool before running the job.
delete_batch_job_after_finish Boolean flag to indicate whether to delete the job from the Batch account after it's finished.
delete_batch_pool_after_finish Boolean flag to indicate whether to delete the pool after the job finishes.
is_positive_exit_code_failure Boolean flag to indicate if the job fails if the task exits with a positive code.
vm_image_urn If create_pool is True, and VM uses VirtualMachineConfiguration.
pool_id The ID of the pool where the job will run.
allow_reuse Determines whether the step should reuse previous results when run again with the same settings.

The following example contains an Azure Batch step definition:

pipeline:
    name: SamplePipelineFromYaml
    parameters:
        PipelineParam1:
            type: int
            default: 3
    data_references:
        input:
            datastore: workspaceblobstore
            path_on_datastore: "input.txt"
    default_compute: testbatch
    steps:
        Step1:
            runconfig: "D:\\Yaml\\default_runconfig.yml"
            parameters:
                NUM_ITERATIONS_2:
                    source: PipelineParam1
                NUM_ITERATIONS_1: 7
            type: "AzureBatchStep"
            name: "MyAzureBatchStep"
            pool_id: "MyPoolName"
            create_pool: true
            executable: "azurebatch.cmd"
            source_directory: "D:\\scripts\\AureBatch"
            allow_reuse: false
            inputs:
                input:
                    source: input
            outputs:
                output:
                    destination: output
                    datastore: workspaceblobstore

Databricks step

YAML key Description
compute_target The Azure Databricks compute target to use for this step.
inputs Inputs can be InputPortBinding, DataReference, PortDataReference, PipelineData, Dataset, DatasetDefinition, or PipelineDataset.
outputs Outputs can be either PipelineData or OutputPortBinding.
run_name The name in Databricks for this run.
source_directory Directory that contains the script and other files.
num_workers The static number of workers for the Databricks run cluster.
runconfig The path to a .runconfig file. This file is a YAML representation of the RunConfiguration class. For more information on the structure of this file, see runconfigschema.json.
allow_reuse Determines whether the step should reuse previous results when run again with the same settings.

The following example contains a Databricks step:

pipeline:
    name: SamplePipelineFromYaml
    parameters:
        PipelineParam1:
            type: int
            default: 3
    data_references:
        adls_test_data:
            datastore: adftestadla
            path_on_datastore: "testdata"
        blob_test_data:
            datastore: workspaceblobstore
            path_on_datastore: "dbtest"
    default_compute: mydatabricks
    steps:
        Step1:
            runconfig: "D:\\Yaml\\default_runconfig.yml"
            parameters:
                NUM_ITERATIONS_2:
                    source: PipelineParam1
                NUM_ITERATIONS_1: 7
            type: "DatabricksStep"
            name: "MyDatabrickStep"
            run_name: "DatabricksRun"
            python_script_name: "train-db-local.py"
            source_directory: "D:\\scripts\\Databricks"
            num_workers: 1
            allow_reuse: true
            inputs:
                blob_test_data:
                    source: blob_test_data
            outputs:
                OutputData:
                    destination: Output4
                    datastore: workspaceblobstore
                    bind_mode: mount

Data transfer step

YAML key Description
compute_target The Azure Data Factory compute target to use for this step.
source_data_reference Input connection that serves as the source of data transfer operations. Supported values are InputPortBinding, DataReference, PortDataReference, PipelineData, Dataset, DatasetDefinition, or PipelineDataset.
destination_data_reference Input connection that serves as the destination of data transfer operations. Supported values are PipelineData and OutputPortBinding.
allow_reuse Determines whether the step should reuse previous results when run again with the same settings.

The following example contains a data transfer step:

pipeline:
    name: SamplePipelineFromYaml
    parameters:
        PipelineParam1:
            type: int
            default: 3
    data_references:
        adls_test_data:
            datastore: adftestadla
            path_on_datastore: "testdata"
        blob_test_data:
            datastore: workspaceblobstore
            path_on_datastore: "testdata"
    default_compute: adftest
    steps:
        Step1:
            runconfig: "D:\\Yaml\\default_runconfig.yml"
            parameters:
                NUM_ITERATIONS_2:
                    source: PipelineParam1
                NUM_ITERATIONS_1: 7
            type: "DataTransferStep"
            name: "MyDataTransferStep"
            adla_compute_name: adftest
            source_data_reference:
                adls_test_data:
                    source: adls_test_data
            destination_data_reference:
                blob_test_data:
                    source: blob_test_data

Python script step

YAML key Description
compute_target The compute target to use for this step. The compute target can be an Azure Machine Learning Compute, Virtual Machine (such as the Data Science VM), or HDInsight.
inputs Inputs can be InputPortBinding, DataReference, PortDataReference, PipelineData, Dataset, DatasetDefinition, or PipelineDataset.
outputs Outputs can be either PipelineData or OutputPortBinding.
script_name The name of the Python script (relative to source_directory).
source_directory Directory that contains the script, Conda environment, etc.
runconfig The path to a .runconfig file. This file is a YAML representation of the RunConfiguration class. For more information on the structure of this file, see runconfig.json.
allow_reuse Determines whether the step should reuse previous results when run again with the same settings.

The following example contains a Python script step:

pipeline:
    name: SamplePipelineFromYaml
    parameters:
        PipelineParam1:
            type: int
            default: 3
    data_references:
        DataReference1:
            datastore: workspaceblobstore
            path_on_datastore: testfolder/sample.txt
    default_compute: cpu-cluster
    steps:
        Step1:
            runconfig: "D:\\Yaml\\default_runconfig.yml"
            parameters:
                NUM_ITERATIONS_2:
                    source: PipelineParam1
                NUM_ITERATIONS_1: 7
            type: "PythonScriptStep"
            name: "MyPythonScriptStep"
            script_name: "train.py"
            allow_reuse: True
            source_directory: "D:\\scripts\\PythonScript"
            inputs:
                InputData:
                    source: DataReference1
            outputs:
                OutputData:
                    destination: Output4
                    datastore: workspaceblobstore
                    bind_mode: mount

Schedules

When defining the schedule for a pipeline, it can be either datastore-triggered or recurring based on a time interval. The following are the keys used to define a schedule:

YAML key Description
description A description of the schedule.
recurrence Contains recurrence settings, if the schedule is recurring.
pipeline_parameters Any parameters that are required by the pipeline.
wait_for_provisioning Whether to wait for provisioning of the schedule to complete.
wait_timeout The number of seconds to wait before timing out.
datastore_name The datastore to monitor for modified/added blobs.
polling_interval How long, in minutes, between polling for modified/added blobs. Default value: 5 minutes. Only supported for datastore schedules.
data_path_parameter_name The name of the data path pipeline parameter to set with the changed blob path. Only supported for datastore schedules.
continue_on_step_failure Whether to continue execution of other steps in the submitted PipelineRun if a step fails. If provided, will override the continue_on_step_failure setting of the pipeline.
path_on_datastore Optional. The path on the datastore to monitor for modified/added blobs. The path is under the container for the datastore, so the actual path the schedule monitors is container/path_on_datastore. If none, the datastore container is monitored. Additions/modifications made in a subfolder of the path_on_datastore are not monitored. Only supported for datastore schedules.

The following example contains the definition for a datastore-triggered schedule:

Schedule: 
      description: "Test create with datastore" 
      recurrence: ~ 
      pipeline_parameters: {} 
      wait_for_provisioning: True 
      wait_timeout: 3600 
      datastore_name: "workspaceblobstore" 
      polling_interval: 5 
      data_path_parameter_name: "input_data" 
      continue_on_step_failure: None 
      path_on_datastore: "file/path" 

When defining a recurring schedule, use the following keys under recurrence:

YAML key Description
frequency How often the schedule recurs. Valid values are "Minute", "Hour", "Day", "Week", or "Month".
interval How often the schedule fires. The integer value is the number of time units to wait until the schedule fires again.
start_time The start time for the schedule. The string format of the value is YYYY-MM-DDThh:mm:ss. If no start time is provided, the first workload is run instantly and future workloads are run based on the schedule. If the start time is in the past, the first workload is run at the next calculated run time.
time_zone The time zone for the start time. If no time zone is provided, UTC is used.
hours If frequency is "Day" or "Week", you can specify one or more integers from 0 to 23, separated by commas, as the hours of the day when the pipeline should run. Only time_of_day or hours and minutes can be used.
minutes If frequency is "Day" or "Week", you can specify one or more integers from 0 to 59, separated by commas, as the minutes of the hour when the pipeline should run. Only time_of_day or hours and minutes can be used.
time_of_day If frequency is "Day" or "Week", you can specify a time of day for the schedule to run. The string format of the value is hh:mm. Only time_of_day or hours and minutes can be used.
week_days If frequency is "Week", you can specify one or more days, separated by commas, when the schedule should run. Valid values are "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", and "Sunday".

The following example contains the definition for a recurring schedule:

Schedule: 
    description: "Test create with recurrence" 
    recurrence: 
        frequency: Week # Can be "Minute", "Hour", "Day", "Week", or "Month". 
        interval: 1 # how often fires 
        start_time: 2019-06-07T10:50:00 
        time_zone: UTC 
        hours: 
        - 1 
        minutes: 
        - 0 
        time_of_day: null 
        week_days: 
        - Friday 
    pipeline_parameters: 
        'a': 1 
    wait_for_provisioning: True 
    wait_timeout: 3600 
    datastore_name: ~ 
    polling_interval: ~ 
    data_path_parameter_name: ~ 
    continue_on_step_failure: None 
    path_on_datastore: ~ 

Next steps

Learn how to use the CLI extension for Azure Machine Learning.