CLI (v2) parallel job YAML schema

APPLIES TO: Azure CLI ml extension v2 (current)

Important

Parallel job can only be used as a single step inside an Azure Machine Learning pipeline job. Thus, there is no source JSON schema for parallel job at this time. This document lists the valid keys and their values when creating a parallel job in a pipeline.

Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.

YAML syntax

Key Type Description Allowed values Default value
type const Required. The type of job. parallel
inputs object Dictionary of inputs to the parallel job. The key is a name for the input within the context of the job and the value is the input value.

Inputs can be referenced in the program_arguments using the ${{ inputs.<input_name> }} expression.

Parallel job inputs can be referenced by pipeline inputs using the ${{ parent.inputs.<input_name> }} expression. For how to bind the inputs of a parallel step to the pipeline inputs, see the Expression syntax for binding inputs and outputs between steps in a pipeline job.
inputs.<input_name> number, integer, boolean, string or object One of a literal value (of type number, integer, boolean, or string) or an object containing a job input data specification.
outputs object Dictionary of output configurations of the parallel job. The key is a name for the output within the context of the job and the value is the output configuration.

Parallel job outputs can be referenced by pipeline outputs using the ${{ parents.outputs.<output_name> }} expression. For how to bind the outputs of a parallel step to the pipeline outputs, see the Expression syntax for binding inputs and outputs between steps in a pipeline job.
outputs.<output_name> object You can leave the object empty, in which case by default the output will be of type uri_folder and Azure Machine Learning will system-generate an output location for the output based on the following templatized path: {settings.datastore}/azureml/{job-name}/{output-name}/. File(s) to the output directory will be written via read-write mount. If you want to specify a different mode for the output, provide an object containing the job output specification.
compute string Name of the compute target to execute the job on. The value can be either a reference to an existing compute in the workspace (using the azureml:<compute_name> syntax) or local to designate local execution.

When using parallel job in pipeline, you can leave this setting empty, in which case the compute will be auto-selected by the default_compute of pipeline.
local
task object Required. The template for defining the distributed tasks for parallel job. See Attributes of the task key.
input_data object Required. Define which input data will be split into mini-batches to run the parallel job. Only applicable for referencing one of the parallel job inputs by using the ${{ inputs.<input_name> }} expression
mini_batch_size string Define the size of each mini-batch to split the input.

If the input_data is a folder or set of files, this number defines the file count for each mini-batch. For example, 10, 100.
If the input_data is a tabular data from mltable, this number defines the proximate physical size for each mini-batch. For example, 100 kb, 100 mb.
1
partition_keys list The keys used to partition dataset into mini-batches.

If specified, the data with the same key will be partitioned into the same mini-batch. If both partition_keys and mini_batch_size are specified, the partition keys will take effect.
mini_batch_error_threshold integer Define the number of failed mini batches that could be ignored in this parallel job. If the count of failed mini-batch is higher than this threshold, the parallel job will be marked as failed.

Mini-batch is marked as failed if:
- the count of return from run() is less than mini-batch input count.
- catch exceptions in custom run() code.

"-1" is the default number, which means to ignore all failed mini-batch during parallel job.
[-1, int.max] -1
logging_level string Define which level of logs will be dumped to user log files. INFO, WARNING, DEBUG INFO
resources.instance_count integer The number of nodes to use for the job. 1
max_concurrency_per_instance integer Define the number of processes on each node of compute.

For a GPU compute, the default value is 1.
For a CPU compute, the default value is the number of cores.
retry_settings.max_retries integer Define the number of retries when mini-batch is failed or timeout. If all retries are failed, the mini-batch will be marked as failed to be counted by mini_batch_error_threshold calculation. 2
retry_settings.timeout integer Define the timeout in seconds for executing custom run() function. If the execution time is higher than this threshold, the mini-batch will be aborted, and marked as a failed mini-batch to trigger retry. (0, 259200] 60
environment_variables object Dictionary of environment variable key-value pairs to set on the process where the command is executed.

Attributes of the task key

Key Type Description Allowed values Default value
type const Required. The type of task. Only applicable for run_function by now.

In run_function mode, you're required to provide code, entry_script, and program_arguments to define Python script with executable functions and arguments. Note: Parallel job only supports Python script in this mode.
run_function run_function
code string Local path to the source code directory to be uploaded and used for the job.
entry_script string The Python file that contains the implementation of pre-defined parallel functions. For more information, see Prepare entry script to parallel job.
environment string or object Required The environment to use for running the task. The value can be either a reference to an existing versioned environment in the workspace or an inline environment specification.

To reference an existing environment, use the azureml:<environment_name>:<environment_version> syntax or azureml:<environment_name>@latest (to reference the latest version of an environment).

To define an inline environment, follow the Environment schema. Exclude the name and version properties as they aren't supported for inline environments.
program_arguments string The arguments to be passed to the entry script. May contain "--<arg_name> ${{inputs.<intput_name>}}" reference to inputs or outputs.

Parallel job provides a list of predefined arguments to set configuration of parallel run. For more information, see predefined arguments for parallel job.
append_row_to string Aggregate all returns from each run of mini-batch and output it into this file. May reference to one of the outputs of parallel job by using the expression ${{outputs.<output_name>}}

Job inputs

Key Type Description Allowed values Default value
type string The type of job input. Specify mltable for input data that points to a location where has the mltable meta file, or uri_folder for input data that points to a folder source. mltable, uri_folder uri_folder
path string The path to the data to use as input. The value can be specified in a few ways:

- A local path to the data source file or folder, for example, path: ./iris.csv. The data will get uploaded during job submission.

- A URI of a cloud path to the file or folder to use as the input. Supported URI types are azureml, https, wasbs, abfss, adl. For more information, see Core yaml syntax on how to use the azureml:// URI format.

- An existing registered Azure Machine Learning data asset to use as the input. To reference a registered data asset, use the azureml:<data_name>:<data_version> syntax or azureml:<data_name>@latest (to reference the latest version of that data asset), for example, path: azureml:cifar10-data:1 or path: azureml:cifar10-data@latest.
mode string Mode of how the data should be delivered to the compute target.

For read-only mount (ro_mount), the data will be consumed as a mount path. A folder will be mounted as a folder and a file will be mounted as a file. Azure Machine Learning will resolve the input to the mount path.

For download mode the data will be downloaded to the compute target. Azure Machine Learning will resolve the input to the downloaded path.

If you only want the URL of the storage location of the data artifact(s) rather than mounting or downloading the data itself, you can use the direct mode. It will pass in the URL of the storage location as the job input. In this case, you're fully responsible for handling credentials to access the storage.
ro_mount, download, direct ro_mount

Job outputs

Key Type Description Allowed values Default value
type string The type of job output. For the default uri_folder type, the output will correspond to a folder. uri_folder uri_folder
mode string Mode of how output file(s) will get delivered to the destination storage. For read-write mount mode (rw_mount) the output directory will be a mounted directory. For upload mode the file(s) written will get uploaded at the end of the job. rw_mount, upload rw_mount

Predefined arguments for parallel job

Key Description Allowed values Default value
--error_threshold The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job will be marked as failed.

Note: "-1" is the default number, which means to ignore all failures during parallel job.
[-1, int.max] -1
--allowed_failed_percent Similar to mini_batch_error_threshold but uses the percent of failed mini-batches instead of the count. [0, 100] 100
--task_overhead_timeout The timeout in second for initialization of each mini-batch. For example, load mini-batch data and pass it to run() function. (0, 259200] 30
--progress_update_timeout The timeout in second for monitoring the progress of mini-batch execution. If no progress updates receive within this timeout setting, the parallel job will be marked as failed. (0, 259200] Dynamically calculated by other settings.
--first_task_creation_timeout The timeout in second for monitoring the time between the job start to the run of first mini-batch. (0, 259200] 600
--copy_logs_to_parent Boolean option to whether copy the job progress, overview, and logs to the parent pipeline job. True, False False
--metrics_name_prefix Provide the custom prefix of your metrics in this parallel job.
--push_metrics_to_parent Boolean option to whether push metrics to the parent pipeline job. True, False False
--resource_monitor_interval The time interval in seconds to dump node resource usage(for example, cpu, memory) to log folder under "logs/sys/perf" path.

Note: Frequent dump resource logs will slightly slow down the execution speed of your mini-batch. Set this value to "0" to stop dumping resource usage.
[0, int.max] 600

Remarks

The az ml job commands can be used for managing Azure Machine Learning jobs.

Examples

Examples are available in the examples GitHub repository. Several are shown below.

YAML: Using parallel job in pipeline

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline

display_name: iris-batch-prediction-using-parallel
description: The hello world pipeline job with inline parallel job
tags:
  tag: tagvalue
  owner: sdkteam

settings:
  default_compute: azureml:cpu-cluster

jobs:
  batch_prediction:
    type: parallel
    compute: azureml:cpu-cluster
    inputs:
      input_data: 
        type: mltable
        path: ./neural-iris-mltable
        mode: direct
      score_model: 
        type: uri_folder
        path: ./iris-model
        mode: download
    outputs:
      job_output_file:
        type: uri_file
        mode: rw_mount

    input_data: ${{inputs.input_data}}
    mini_batch_size: "10kb"
    resources:
        instance_count: 2
    max_concurrency_per_instance: 2

    logging_level: "DEBUG"
    mini_batch_error_threshold: 5
    retry_settings:
      max_retries: 2
      timeout: 60

    task:
      type: run_function
      code: "./script"
      entry_script: iris_prediction.py
      environment:
        name: "prs-env"
        version: 1
        image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
        conda_file: ./environment/environment_parallel.yml
      program_arguments: >-
        --model ${{inputs.score_model}}
        --error_threshold 5
        --allowed_failed_percent 30
        --task_overhead_timeout 1200
        --progress_update_timeout 600
        --first_task_creation_timeout 600
        --copy_logs_to_parent True
        --resource_monitor_interva 20
      append_row_to: ${{outputs.job_output_file}}

Next steps