Train models with the CLI (v2) (preview)

The Azure Machine Learning CLI (v2) is an Azure CLI extension enabling you to accelerate the model training process while scaling up and out on Azure compute, with the model lifecycle tracked and auditable.

Training a machine learning model is typically an iterative process. Modern tooling makes it easier than ever to train larger models on more data faster. Previously tedious manual processes like hyperparameter tuning and even algorithm selection are often automated. With the Azure Machine Learning CLI (v2), you can track your jobs (and models) in a workspace with hyperparameter sweeps, scale-up on high-performance Azure compute, and scale-out utilizing distributed training.

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Prerequisites

Tip

For a full-featured development environment with schema validation and autocompletion for job YAMLs, use Visual Studio Code and the Azure Machine Learning extension.

Clone examples repository

To run the training examples, first clone the examples repository and change into the cli directory:

git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/cli

Using --depth 1 clones only the latest commit to the repository, which reduces time to complete the operation.

Create compute

You can create an Azure Machine Learning compute cluster from the command line. For instance, the following commands will create one cluster named cpu-cluster and one named gpu-cluster.

az ml compute create -n cpu-cluster --type amlcompute --min-instances 0 --max-instances 8
az ml compute create -n gpu-cluster --type amlcompute --min-instances 0 --max-instances 4 --size Standard_NC12

You are not charged for compute at this point as cpu-cluster and gpu-cluster will remain at zero nodes until a job is submitted. Learn more about how to manage and optimize cost for AmlCompute.

The following example jobs in this article use one of cpu-cluster or gpu-cluster. Adjust these names in the example jobs throughout this article as needed to the name of your cluster(s). Use az ml compute create -h for more details on compute create options.

Hello world

For the Azure Machine Learning CLI (v2), jobs are authored in YAML format. A job aggregates:

  • What to run
  • How to run it
  • Where to run it

The "hello world" job has all three:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: python:latest
compute: azureml:cpu-cluster

Warning

Python must be installed in the environment used for jobs. Run apt-get update -y && apt-get install python3 -y in your Dockerfile to install if needed, or derive from a base image with Python installed already.

Tip

The $schema: throughout examples allows for schema validation and autocompletion if authoring YAML files in VSCode with the Azure Machine Learning extension.

Which you can run:

az ml job create -f jobs/basics/hello-world.yml --web

Tip

The --web parameter will attempt to open your job in the Azure Machine Learning studio using your default web browser. The --stream parameter can be used to stream logs to the console and block further commands.

Overriding values on create or update

YAML job specification values can be overridden using --set when creating or updating a job. For instance:

az ml job create -f jobs/basics/hello-world.yml \
  --set environment.image="python:3.8" \
  --web

Job names

Most az ml job commands other than create and list require --name/-n, which is a job's name or "Run ID" in the studio. You should not directly set a job's name property during creation as it must be unique per workspace. Azure Machine Learning generates a random GUID for the job name if it is not set which can be obtained from the output of job creation in the CLI or by copying the "Run ID" property in the studio and MLflow APIs.

To automate jobs in scripts and CI/CD flows, you can capture a job's name when it is created by querying and stripping the output by adding --query name -o tsv. The specifics will vary by shell, but for Bash:

run_id=$(az ml job create -f jobs/basics/hello-world.yml --query name -o tsv)

Then use $run_id in subsequent commands like update, show, or stream:

az ml job show -n $run_id --web

Organize jobs

To organize jobs, you can set a display name, experiment name, description, and tags. Descriptions support markdown syntax in the studio. These properties are mutable after a job is created. A full example:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: python:latest
compute: azureml:cpu-cluster
tags:
  hello: world
display_name: hello-world-example
experiment_name: hello-world-example
description: |
  # Azure Machine Learning "hello world" job

  This is a "hello world" job running in the cloud via Azure Machine Learning!

  ## Description

  Markdown is supported in the studio for job descriptions! You can edit the description there or via CLI.

You can run this job, where these properties will be immediately visible in the studio:

az ml job create -f jobs/basics/hello-world-org.yml --web

Using --set you can update the mutable values after the job is created:

az ml job update -n $run_id --set \
  display_name="updated display name" \
  experiment_name="updated experiment name" \
  description="updated description"  \
  tags.hello="updated tag"

Environment variables

You can set environment variables for use in your job:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo $hello_env_var
environment:
  image: python:latest
compute: azureml:cpu-cluster
environment_variables:
  hello_env_var: "hello world"

You can run this job:

az ml job create -f jobs/basics/hello-world-env-var.yml --web

Warning

You should use inputs for parameterizing arguments in the command. See inputs and outputs.

Track models and source code

Production machine learning models need to be auditable (if not reproducible). It is crucial to keep track of the source code for a given model. Azure Machine Learning takes a snapshot of your source code and keeps it with the job. Additionally, the source repository and commit are kept if you are running jobs from a Git repository.

Tip

If you're following along and running from the examples repository, you can see the source repository and commit in the studio on any of the jobs run so far.

You can specify the code.local_path key in a job with the value as the path to a source code directory. A snapshot of the directory is taken and uploaded with the job. The contents of the directory are directly available from the working directory of the job.

Warning

The source code should not include large data inputs for model training. Instead, use data inputs. You can use a .gitignore file in the source code directory to exclude files from the snapshot. The limits for snapshot size are 300 MB or 2000 files.

Let's look at a job that specifies code:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
  pip install mlflow azureml-mlflow
  &&
  python hello-mlflow.py
code:
  local_path: src
environment:
  image: python:3.8
compute: azureml:cpu-cluster

The Python script is in the local source code directory. The command then invokes python to run the script. The same pattern can be applied for other programming languages.

Warning

The "hello" family of jobs shown in this article are for demonstration purposes and do not necessarily follow recommended best practices. Using && or similar to run many commands in a sequence is not recommended -- instead, consider writing the commands to a script file in the source code directory and invoking the script in your command. Installing dependencies in the command, as shown above via pip install, is not recommended -- instead, all job dependencies should be specified as part of your environment. See how to manage environments with the CLI (v2) for details.

Model tracking with MLflow

While iterating on models, data scientists need to be able to keep track of model parameters and training metrics. Azure Machine Learning integrates with MLflow tracking to enable the logging of models, artifacts, metrics, and parameters to a job. To use MLflow in your Python scripts add import mlflow and call mlflow.log_* or mlflow.autolog() APIs in your training code.

Warning

The mlflow and azureml-mlflow packages must be installed in your Python environment for MLflow tracking features.

Tip

The mlflow.autolog() call is supported for many popular frameworks and takes care of the majority of logging for you.

Let's take a look at Python script invoked in the job above that uses mlflow to log a parameter, a metric, and an artifact:

# imports
import os
import mlflow

from random import random

# define functions
def main():
    mlflow.log_param("hello_param", "world")
    mlflow.log_metric("hello_metric", random())
    os.system(f"echo 'hello world' > helloworld.txt")
    mlflow.log_artifact("helloworld.txt")


# run functions
if __name__ == "__main__":
    # run main function
    main()

You can run this job in the cloud via Azure Machine Learning, where it is tracked and auditable:

az ml job create -f jobs/basics/hello-mlflow.yml --web

Query metrics with MLflow

After running jobs, you might want to query the jobs' run results and their logged metrics. Python is better suited for this task than a CLI. You can query runs and their metrics via mlflow and load into familiar objects like Pandas dataframes for analysis.

First, retrieve the MLflow tracking URI for your Azure Machine Learning workspace:

az ml workspace show --query mlflow_tracking_uri -o tsv

Use the output of this command in mlflow.set_tracking_uri(<YOUR_TRACKING_URI>) from a Python environment with MLflow imported. MLflow calls will now correspond to jobs in your Azure Machine Learning workspace.

Inputs and outputs

Jobs typically have inputs and outputs. Inputs can be model parameters, which might be swept over for hyperparameter optimization, or cloud data inputs that are mounted or downloaded to the compute target. Outputs (ignoring metrics) are artifacts that can be written or copied to the default outputs or a named data output.

Literal inputs

Literal inputs are directly resolved in the command. You can modify our "hello world" job to use literal inputs:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
  echo ${{inputs.hello_string}}
  &&
  echo ${{inputs.hello_number}}
environment:
  image: python:latest
inputs:
  hello_string: "hello world"
  hello_number: 42
compute: azureml:cpu-cluster

You can run this job:

az ml job create -f jobs/basics/hello-world-input.yml --web

You can use --set to override inputs:

az ml job create -f jobs/basics/hello-world-input.yml --set \
  inputs.hello_string="hello there" \
  inputs.hello_number=24 \
  --web

Literal inputs to jobs can be converted to search space inputs for hyperparameter sweeps on model training.

Search space inputs

For a sweep job, you can specify a search space for literal inputs to be chosen from. For the full range of options for search space inputs, see the sweep job YAML syntax reference.

Warning

Sweep jobs are not currently supported in pipeline jobs.

Let's demonstrate the concept with a simple Python script that takes in arguments and logs a random metric:

# imports
import os
import mlflow
import argparse

from random import random

# define functions
def main(args):
    # print inputs
    print(f"A: {args.A}")
    print(f"B: {args.B}")
    print(f"C: {args.C}")

    # log inputs as parameters
    mlflow.log_param("A", args.A)
    mlflow.log_param("B", args.B)
    mlflow.log_param("C", args.C)

    # log a random metric
    mlflow.log_metric("random_metric", random())


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--A", type=float, default=0.5)
    parser.add_argument("--B", type=str, default="hello world")
    parser.add_argument("--C", type=float, default=1.0)

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()

    # run main function
    main(args)

And create a corresponding sweep job:

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
  command: >-
    pip install mlflow azureml-mlflow
    &&
    python hello-sweep.py
    --A ${{inputs.A}}
    --B ${{search_space.B}}
    --C ${{search_space.C}}
  code:
    local_path: src
  environment:
    image: python:3.8
inputs:
  A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
  B:
    type: choice
    values: ["hello", "world", "hello world"]
  C:
    type: uniform
    min_value: 0.1
    max_value: 1.0
objective:
  goal: minimize
  primary_metric: random_metric
limits:
  max_total_trials: 4
  max_concurrent_trials: 2
  timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.

And run it:

az ml job create -f jobs/basics/hello-sweep.yml --web

Data inputs

Data inputs are resolved to a path on the job compute's local filesystem. Let's demonstrate with the classic Iris dataset, which is hosted publicly in a blob container at https://azuremlexamples.blob.core.windows.net/datasets/iris.csv.

You can author a Python script that takes the path to the Iris CSV file as an argument, reads it into a dataframe, prints the first 5 lines, and saves it to the outputs directory.

# imports
import os
import argparse

import pandas as pd

# define functions
def main(args):
    # read in data
    df = pd.read_csv(args.iris_csv)

    # print first 5 lines
    print(df.head())

    # ensure outputs directory exists
    os.makedirs("outputs", exist_ok=True)

    # save data to outputs
    df.to_csv("outputs/iris.csv", index=False)


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--iris-csv", type=str)

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()

    # run main function
    main(args)

Azure storage URI inputs can be specified, which will mount or download data to the local filesystem. You can specify a single file:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
  echo "--iris-csv: ${{inputs.iris_csv}}"
  &&
  pip install pandas
  &&
  python hello-iris.py
  --iris-csv ${{inputs.iris_csv}}
code:
  local_path: src
inputs:
  iris_csv: 
    file: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
environment:
  image: python:latest
compute: azureml:cpu-cluster

And run:

az ml job create -f jobs/basics/hello-iris-file.yml --web

Or specify an entire folder:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
  ls ${{inputs.data_dir}}
  &&
  echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
  &&
  pip install pandas
  &&
  python hello-iris.py
  --iris-csv ${{inputs.data_dir}}/iris.csv
code:
  local_path: src
inputs:
  data_dir: 
    folder: wasbs://datasets@azuremlexamples.blob.core.windows.net/
environment:
  image: python:latest
compute: azureml:cpu-cluster

And run:

az ml job create -f jobs/basics/hello-iris-folder.yml --web

Private data

For private data in Azure Blob Storage or Azure Data Lake Storage connected to Azure Machine Learning through a datastore, you can use Azure Machine Learning URIs of the format azureml://datastores/<DATASTORE_NAME>/paths/<PATH_TO_DATA> for input data. For instance, if you upload the Iris CSV to a directory named /example-data/ in the Blob container corresponding to the datastore named workspaceblobstore you can modify a previous job to use the file in the datastore:

Warning

Running these jobs will fail for you if you have not copied the Iris CSV to the same location in workspaceblobstore.

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
  echo "--iris-csv: ${{inputs.iris_csv}}"
  &&
  pip install pandas
  &&
  python hello-iris.py
  --iris-csv ${{inputs.iris_csv}}
code:
  local_path: src
inputs:
  iris_csv: 
    file: azureml://datastores/workspaceblobstore/paths/example-data/iris.csv
environment:
  image: python:latest
compute: azureml:cpu-cluster

Or the entire directory:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
  ls ${{inputs.data_dir}}
  &&
  echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
  &&
  pip install pandas
  &&
  python hello-iris.py
  --iris-csv ${{inputs.data_dir}}/iris.csv
code:
  local_path: src
inputs:
  data_dir: 
    folder: azureml://datastores/workspaceblobstore/paths/example-data/
    mode: rw_mount
environment:
  image: python:latest
compute: azureml:cpu-cluster

Default outputs

The ./outputs and ./logs directories receive special treatment by Azure Machine Learning. If you write any files to these directories during your job, these files will get uploaded to the job so that you can still access them once it is complete. The ./outputs folder is uploaded at the end of the job, while the files written to ./logs are uploaded in real time. Use the latter if you want to stream logs during the job, such as TensorBoard logs.

You can modify the "hello world" job to output to a file in the default outputs directory instead of printing to stdout:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ./outputs/helloworld.txt
environment:
  image: python:latest
compute: azureml:cpu-cluster

You can run this job:

az ml job create -f jobs/basics/hello-world-output.yml --web

And download the logs, where helloworld.txt will be present in the <RUN_ID>/outputs/ directory:

az ml job download -n $run_id

Data outputs

You can specify named data outputs. This will create a directory in the default datastore which will be read/write mounted by default.

You can modify the earlier "hello world" job to write to a named data output:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ${{outputs.hello_output}}/helloworld.txt
outputs:
  hello_output:
environment:
  image: python
compute: azureml:cpu-cluster

Hello pipelines

Pipeline jobs can run multiple jobs in parallel or in sequence. If there are input/output dependencies between steps in a pipeline, the dependent step will run after the other completes.

You can split a "hello world" job into two jobs:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
jobs:
  hello_job:
    command: echo "hello"
    environment:
      image: python:latest
    compute: azureml:cpu-cluster
  world_job:
    command: echo "world"
    environment:
      image: python
    compute: azureml:cpu-cluster

And run it:

az ml job create -f jobs/basics/hello-pipeline.yml --web

The "hello" and "world" jobs respectively will run in parallel if the compute target has the available resources to do so.

To pass data between steps in a pipeline, define a data output in the "hello" job and a corresponding input in the "world" job, which refers to the prior's output:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
jobs:
  hello_job:
    command: echo "hello" && echo "world" > ${{outputs.world_output}}/world.txt
    environment:
      image: python:latest
    compute: azureml:cpu-cluster
    outputs:
      world_output:
  world_job:
    command: cat ${{inputs.world_input}}/world.txt
    environment:
      image: python:latest
    compute: azureml:cpu-cluster
    inputs:
      world_input: ${{jobs.hello_job.outputs.world_output}}

And run it:

az ml job create -f jobs/basics/hello-pipeline-io.yml --web

This time, the "world" job will run after the "hello" job completes.

To avoid duplicating common settings across jobs in a pipeline, you can set them outside the jobs:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
compute: azureml:cpu-cluster
settings:
  environment:
    image: python:latest
jobs:
  hello_job:
    command: echo "hello"
  world_job:
    command: echo "world"

You can run this:

az ml job create -f jobs/basics/hello-pipeline-settings.yml --web

The corresponding setting on an individual job will override the common settings for a pipeline job. The concepts so far can be combined into a three-step pipeline job with jobs "A", "B", and "C". The "C" job has a data dependency on the "B" job, while the "A" job can run independently. The "A" job will also use an individually set environment and bind one of its inputs to a top-level pipeline job input:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
compute: azureml:cpu-cluster
settings:
  environment:
    image: python:latest
inputs:
  hello_string_top_level_input: "hello world"
jobs:
  A:
    command: echo hello ${{inputs.hello_string}}
    environment:
      image: python:3.9
    inputs:
      hello_string: ${{inputs.hello_string_top_level_input}}
  B:
    command: echo "world" >> ${{outputs.world_output}}/world.txt
    outputs:
      world_output:
  C:
    command: echo ${{inputs.world_input}}/world.txt
    inputs:
      world_input: ${{jobs.B.outputs.world_output}}

You can run this:

az ml job create -f jobs/basics/hello-pipeline-abc.yml --web

Train a model

At this point, a model still hasn't been trained. Let's add some sklearn code into a Python script with MLflow tracking to train a model on the Iris CSV:

# imports
import os
import mlflow
import argparse

import pandas as pd

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# define functions
def main(args):
    # enable auto logging
    mlflow.autolog()

    # setup parameters
    params = {
        "C": args.C,
        "kernel": args.kernel,
        "degree": args.degree,
        "gamma": args.gamma,
        "coef0": args.coef0,
        "shrinking": args.shrinking,
        "probability": args.probability,
        "tol": args.tol,
        "cache_size": args.cache_size,
        "class_weight": args.class_weight,
        "verbose": args.verbose,
        "max_iter": args.max_iter,
        "decision_function_shape": args.decision_function_shape,
        "break_ties": args.break_ties,
        "random_state": args.random_state,
    }

    # read in data
    df = pd.read_csv(args.iris_csv)

    # process data
    X_train, X_test, y_train, y_test = process_data(df, args.random_state)

    # train model
    model = train_model(params, X_train, X_test, y_train, y_test)


def process_data(df, random_state):
    # split dataframe into X and y
    X = df.drop(["species"], axis=1)
    y = df["species"]

    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state
    )

    # return split data
    return X_train, X_test, y_train, y_test


def train_model(params, X_train, X_test, y_train, y_test):
    # train model
    model = SVC(**params)
    model = model.fit(X_train, y_train)

    # return model
    return model


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--iris-csv", type=str)
    parser.add_argument("--C", type=float, default=1.0)
    parser.add_argument("--kernel", type=str, default="rbf")
    parser.add_argument("--degree", type=int, default=3)
    parser.add_argument("--gamma", type=str, default="scale")
    parser.add_argument("--coef0", type=float, default=0)
    parser.add_argument("--shrinking", type=bool, default=False)
    parser.add_argument("--probability", type=bool, default=False)
    parser.add_argument("--tol", type=float, default=1e-3)
    parser.add_argument("--cache_size", type=float, default=1024)
    parser.add_argument("--class_weight", type=dict, default=None)
    parser.add_argument("--verbose", type=bool, default=False)
    parser.add_argument("--max_iter", type=int, default=-1)
    parser.add_argument("--decision_function_shape", type=str, default="ovr")
    parser.add_argument("--break_ties", type=bool, default=False)
    parser.add_argument("--random_state", type=int, default=42)

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()

    # run main function
    main(args)

The scikit-learn framework is supported by MLflow for autologging, so a single mlflow.autolog() call in the script will log all model parameters, training metrics, model artifacts, and some extra artifacts (in this case a confusion matrix image).

To run this in the cloud, specify as a job:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: 
  local_path: src
command: >-
  python main.py 
  --iris-csv ${{inputs.iris_csv}}
  --C ${{inputs.C}}
  --kernel ${{inputs.kernel}}
  --coef0 ${{inputs.coef0}}
inputs:
  iris_csv: 
    file: wasbs://datasets@azuremlexamples.blob.core.windows.net/iris.csv
  C: 0.8
  kernel: "rbf"
  coef0: 0.1
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9
compute: azureml:cpu-cluster
display_name: sklearn-iris-example
experiment_name: sklearn-iris-example
description: Train a scikit-learn SVM on the Iris dataset.

And run it:

az ml job create -f jobs/single-step/scikit-learn/iris/job.yml --web

To register a model, you can download the outputs and create a model from the local directory:

az ml job download -n $run_id
az ml model create -n sklearn-iris-example -l $run_id/model/

Sweep hyperparameters

You can modify the previous job to sweep over hyperparameters:

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
  code: 
    local_path: src
  command: >-
    python main.py 
    --iris-csv ${{inputs.iris_csv}}
    --C ${{search_space.C}}
    --kernel ${{search_space.kernel}}
    --coef0 ${{search_space.coef0}}
  environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9
inputs:
  iris_csv: 
    file: wasbs://datasets@azuremlexamples.blob.core.windows.net/iris.csv
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
  C:
    type: uniform
    min_value: 0.5
    max_value: 0.9
  kernel:
    type: choice
    values: ["rbf", "linear", "poly"]
  coef0:
    type: uniform
    min_value: 0.1
    max_value: 1
objective:
  goal: minimize
  primary_metric: training_f1_score
limits:
  max_total_trials: 20
  max_concurrent_trials: 10
  timeout: 7200
display_name: sklearn-iris-sweep-example
experiment_name: sklearn-iris-sweep-example
description: Sweep hyperparemeters for training a scikit-learn SVM on the Iris dataset.

And run it:

az ml job create -f jobs/single-step/scikit-learn/iris/job-sweep.yml --web

Tip

Check the "Child runs" tab in the studio to monitor progress and view parameter charts..

For more sweep options, see the sweep job YAML syntax reference.

Distributed training

Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed training. See the distributed section of the command job YAML syntax reference for details.

As an example, you can train a convolutional neural network (CNN) on the CIFAR-10 dataset using distributed PyTorch. The full script is available in the examples repository.

The CIFAR-10 dataset in torchvision expects as input a directory that contains the cifar-10-batches-py directory. You can download the zipped source and extract into a local directory:

mkdir data
wget "https://azuremlexamples.blob.core.windows.net/datasets/cifar-10-python.tar.gz"
tar -xvzf cifar-10-python.tar.gz -C data

Then create an Azure Machine Learning dataset from the local directory, which will be uploaded to the default datastore:

az ml dataset create --name cifar-10-example --version 1 --set local_path=data

Optionally, remove the local file and directory:

rm cifar-10-python.tar.gz
rm -r data

Datasets (File only) can be referred to in a job using the dataset key of a data input. The format is azureml:<DATASET_NAME>:<DATASET_VERSION>, so for the CIFAR-10 dataset just created, it is azureml:cifar-10-example:1.

With the dataset in place, you can author a distributed PyTorch job to train our model:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: 
  local_path: src
command: >-
  python train.py 
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data-dir ${{inputs.cifar}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar:
    dataset: azureml:cifar-10-example:1
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6
compute: azureml:gpu-cluster
distribution:
  type: pytorch 
  process_count_per_instance: 2
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.

And run it:

az ml job create -f jobs/single-step/pytorch/cifar-distributed/job.yml --web

Build a training pipeline

The CIFAR-10 example above translates well to a pipeline job. The previous job can be split into three jobs for orchestration in a pipeline:

  • "get-data" to run a Bash script to download and extract cifar-10-batches-py
  • "train-model" to take the data and train a model with distributed PyTorch
  • "eval-model" to take the data and the trained model and evaluate accuracy

Both "train-model" and "eval-model" will have a dependency on the "get-data" job's output. Additionally, "eval-model" will have a dependency on the "train-model" job's output. Thus the three jobs will run sequentially.

You can orchestrate these three jobs within a pipeline job:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: cifar-10-pipeline-example
experiment_name: cifar-10-pipeline-example
jobs:
  get-data:
    command: bash main.sh ${{outputs.cifar}}
    code:
      local_path: src/get-data
    environment:
      image: python:latest
    compute: azureml:cpu-cluster
    outputs:
      cifar:
  train-model:
    command: >-
      python main.py
      --data-dir ${{inputs.cifar}}
      --epochs ${{inputs.epochs}}
      --model-dir ${{outputs.model_dir}}
    code:
      local_path: src/train-model
    inputs:
      epochs: 1
      cifar: ${{jobs.get-data.outputs.cifar}}
    outputs:
      model_dir:
    environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6
    compute: azureml:gpu-cluster
    distribution:
      type: pytorch
      process_count_per_instance: 2
    resources:
      instance_count: 2
  eval-model:
    command: >-
      python main.py
      --data-dir ${{inputs.cifar}}
      --model-dir ${{inputs.model_dir}}/model
    code:
      local_path: src/eval-model
    environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6
    compute: azureml:gpu-cluster
    distribution:
      type: pytorch
      process_count_per_instance: 2
    resources:
      instance_count: 1
    inputs:
      cifar: ${{jobs.get-data.outputs.cifar}}
      model_dir: ${{jobs.train-model.outputs.model_dir}}

And run:

az ml job create -f jobs/pipelines/cifar-10/job.yml --web

Pipelines can also be written using reusable components. For more, see Create and run components-based machine learning pipelines with the Azure Machine Learning CLI (Preview).

Next steps