Train models with the CLI (v2)

APPLIES TO: Azure CLI ml extension v2 (current)

The Azure Machine Learning CLI (v2) is an Azure CLI extension enabling you to accelerate the model training process while scaling up and out on Azure compute, with the model lifecycle tracked and auditable.

Training a machine learning model is typically an iterative process. Modern tooling makes it easier than ever to train larger models on more data faster. Previously tedious manual processes like hyperparameter tuning and even algorithm selection are often automated. With the Azure Machine Learning CLI (v2), you can track your jobs (and models) in a workspace with hyperparameter sweeps, scale-up on high-performance Azure compute, and scale-out utilizing distributed training.

Prerequisites

Tip

For a full-featured development environment with schema validation and autocompletion for job YAMLs, use Visual Studio Code and the Azure Machine Learning extension.

Clone examples repository

To run the training examples, first clone the examples repository and change into the cli directory:

git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/cli

Using --depth 1 clones only the latest commit to the repository, which reduces time to complete the operation.

Create compute

You can create an Azure Machine Learning compute cluster from the command line. For instance, the following commands will create one cluster named cpu-cluster and one named gpu-cluster.


az ml compute create -n cpu-cluster --type amlcompute --min-instances 0 --max-instances 8

az ml compute create -n gpu-cluster --type amlcompute --min-instances 0 --max-instances 4 --size Standard_NC12

You are not charged for compute at this point as cpu-cluster and gpu-cluster will remain at zero nodes until a job is submitted. Learn more about how to manage and optimize cost for AmlCompute.

The following example jobs in this article use one of cpu-cluster or gpu-cluster. Adjust these names in the example jobs throughout this article as needed to the name of your cluster(s). Use az ml compute create -h for more details on compute create options.

Hello world

For the Azure Machine Learning CLI (v2), jobs are authored in YAML format. A job aggregates:

  • What to run
  • How to run it
  • Where to run it

The "hello world" job has all three:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
compute: azureml:cpu-cluster

Warning

Python must be installed in the environment used for jobs. Run apt-get update -y && apt-get install python3 -y in your Dockerfile to install if needed, or derive from a base image with Python installed already.

Tip

The $schema: throughout examples allows for schema validation and autocompletion if authoring YAML files in VSCode with the Azure Machine Learning extension.

Which you can run:

az ml job create -f jobs/basics/hello-world.yml --web

Tip

The --web parameter will attempt to open your job in the Azure Machine Learning studio using your default web browser. The --stream parameter can be used to stream logs to the console and block further commands.

Overriding values on create or update

YAML job specification values can be overridden using --set when creating or updating a job. For instance:

az ml job create -f jobs/basics/hello-world.yml \
  --set environment.image="python:3.8" \
  --web

Job names

Most az ml job commands other than create and list require --name/-n, which is a job's name or "Run ID" in the studio. You typically should not directly set a job's name property during creation as it must be unique per workspace. Azure Machine Learning generates a random GUID for the job name if it is not set that can be obtained from the output of job creation in the CLI or by copying the "Run ID" property in the studio and MLflow APIs.

To automate jobs in scripts and CI/CD flows, you can capture a job's name when it is created by querying and stripping the output by adding --query name -o tsv. The specifics will vary by shell, but for Bash:

run_id=$(az ml job create -f jobs/basics/hello-world.yml --query name -o tsv)

Then use $run_id in subsequent commands like update, show, or stream:

az ml job show -n $run_id --web

Organize jobs

To organize jobs, you can set a display name, experiment name, description, and tags. Descriptions support markdown syntax in the studio. These properties are mutable after a job is created. A full example:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
compute: azureml:cpu-cluster
tags:
  hello: world
display_name: hello-world-example
experiment_name: hello-world-example
description: |
  # Azure Machine Learning "hello world" job

  This is a "hello world" job running in the cloud via Azure Machine Learning!

  ## Description

  Markdown is supported in the studio for job descriptions! You can edit the description there or via CLI.

You can run this job, where these properties will be immediately visible in the studio:

az ml job create -f jobs/basics/hello-world-org.yml --web

Using --set you can update the mutable values after the job is created:

az ml job update -n $run_id --set \
  display_name="updated display name" \
  experiment_name="updated experiment name" \
  description="updated description"  \
  tags.hello="updated tag"

Environment variables

You can set environment variables for use in your job:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo $hello_env_var
environment:
  image: library/python:latest
compute: azureml:cpu-cluster
environment_variables:
  hello_env_var: "hello world"

You can run this job:

az ml job create -f jobs/basics/hello-world-env-var.yml --web

Warning

You should use inputs for parameterizing arguments in the command. See inputs and outputs.

Track models and source code

Production machine learning models need to be auditable (if not reproducible). It is crucial to keep track of the source code for a given model. Azure Machine Learning takes a snapshot of your source code and keeps it with the job. Additionally, the source repository and commit are tracked if you are running jobs from a Git repository.

Tip

If you're following along and running from the examples repository, you can see the source repository and commit in the studio on any of the jobs run so far.

You can specify the code field in a job with the value as the path to a source code directory. A snapshot of the directory is taken and uploaded with the job. The contents of the directory are directly available from the working directory of the job.

Warning

The source code should not include large data inputs for model training. Instead, use data inputs. You can use a .gitignore file in the source code directory to exclude files from the snapshot. The limits for snapshot size are 300 MB or 2000 files.

Let's look at a job that specifies code:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: python hello-mlflow.py
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

The Python script is in the local source code directory. The command then invokes python to run the script. The same pattern can be applied for other programming languages.

Warning

The "hello" family of jobs shown in this article are for demonstration purposes and do not necessarily follow recommended best practices. Using && or similar to run many commands in a sequence is not recommended -- instead, consider writing the commands to a script file in the source code directory and invoking the script in your command. Installing dependencies in the command, as shown above via pip install, is not recommended -- instead, all job dependencies should be specified as part of your environment. See how to manage environments with the CLI (v2) for details.

Model tracking with MLflow

While iterating on models, data scientists need to be able to keep track of model parameters and training metrics. Azure Machine Learning integrates with MLflow tracking to enable the logging of models, artifacts, metrics, and parameters to a job. To use MLflow in your Python scripts add import mlflow and call mlflow.log_* or mlflow.autolog() APIs in your training code.

Warning

The mlflow and azureml-mlflow packages must be installed in your Python environment for MLflow tracking features.

Tip

The mlflow.autolog() call is supported for many popular frameworks and takes care of the majority of logging for you.

Let's take a look at Python script invoked in the job above that uses mlflow to log a parameter, a metric, and an artifact:

# imports
import os
import mlflow

from random import random

# define functions
def main():
    mlflow.log_param("hello_param", "world")
    mlflow.log_metric("hello_metric", random())
    os.system(f"echo 'hello world' > helloworld.txt")
    mlflow.log_artifact("helloworld.txt")


# run functions
if __name__ == "__main__":
    # run main function
    main()

You can run this job in the cloud via Azure Machine Learning, where it is tracked and auditable:

az ml job create -f jobs/basics/hello-mlflow.yml --web

Query metrics with MLflow

After running jobs, you might want to query the jobs' run results and their logged metrics. Python is better suited for this task than a CLI. You can query runs and their metrics via mlflow and load into familiar objects like Pandas dataframes for analysis.

First, retrieve the MLflow tracking URI for your Azure Machine Learning workspace:

az ml workspace show --query mlflow_tracking_uri -o tsv

Use the output of this command in mlflow.set_tracking_uri(<YOUR_TRACKING_URI>) from a Python environment with MLflow imported. MLflow calls will now correspond to jobs in your Azure Machine Learning workspace.

Inputs and outputs

Jobs typically have inputs and outputs. Inputs can be model parameters, which might be swept over for hyperparameter optimization, or cloud data inputs that are mounted or downloaded to the compute target. Outputs (ignoring metrics) are artifacts that can be written or copied to the default outputs or a named data output.

Literal inputs

Literal inputs are directly resolved in the command. You can modify our "hello world" job to use literal inputs:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  echo ${{inputs.hello_string}}
  echo ${{inputs.hello_number}}
environment:
  image: library/python:latest
inputs:
  hello_string: "hello world"
  hello_number: 42
compute: azureml:cpu-cluster

You can run this job:

az ml job create -f jobs/basics/hello-world-input.yml --web

You can use --set to override inputs:

az ml job create -f jobs/basics/hello-world-input.yml --set \
  inputs.hello_string="hello there" \
  inputs.hello_number=24 \
  --web

Literal inputs to jobs can be converted to search space inputs for hyperparameter sweeps on model training.

Search space inputs

For a sweep job, you can specify a search space for literal inputs to be chosen from. For the full range of options for search space inputs, see the sweep job YAML syntax reference.

Let's demonstrate the concept with a simple Python script that takes in arguments and logs a random metric:

# imports
import os
import mlflow
import argparse

from random import random

# define functions
def main(args):
    # print inputs
    print(f"A: {args.A}")
    print(f"B: {args.B}")
    print(f"C: {args.C}")

    # log inputs as parameters
    mlflow.log_param("A", args.A)
    mlflow.log_param("B", args.B)
    mlflow.log_param("C", args.C)

    # log a random metric
    mlflow.log_metric("random_metric", random())


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--A", type=float, default=0.5)
    parser.add_argument("--B", type=str, default="hello world")
    parser.add_argument("--C", type=float, default=1.0)

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()

    # run main function
    main(args)

And create a corresponding sweep job:

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
  command: >-
    python hello-sweep.py
    --A ${{inputs.A}}
    --B ${{search_space.B}}
    --C ${{search_space.C}}
  code: src
  environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
inputs:
  A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
  B:
    type: choice
    values: ["hello", "world", "hello_world"]
  C:
    type: uniform
    min_value: 0.1
    max_value: 1.0
objective:
  goal: minimize
  primary_metric: random_metric
limits:
  max_total_trials: 4
  max_concurrent_trials: 2
  timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.

And run it:

az ml job create -f jobs/basics/hello-sweep.yml --web

Data inputs

Data inputs are resolved to a path on the job compute's local filesystem. Let's demonstrate with the classic Iris dataset, which is hosted publicly in a blob container at https://azuremlexamples.blob.core.windows.net/datasets/iris.csv.

You can author a Python script that takes the path to the Iris CSV file as an argument, reads it into a dataframe, prints the first 5 lines, and saves it to the outputs directory.

# imports
import os
import argparse

import pandas as pd

# define functions
def main(args):
    # read in data
    df = pd.read_csv(args.iris_csv)

    # print first 5 lines
    print(df.head())

    # ensure outputs directory exists
    os.makedirs("outputs", exist_ok=True)

    # save data to outputs
    df.to_csv("outputs/iris.csv", index=False)


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--iris-csv", type=str)

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()

    # run main function
    main(args)

Azure storage URI inputs can be specified, which will mount or download data to the local filesystem. You can specify a single file:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  echo "--iris-csv: ${{inputs.iris_csv}}"
  python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
  iris_csv:
    type: uri_file 
    path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

And run:

az ml job create -f jobs/basics/hello-iris-file.yml --web

Or specify an entire folder:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  ls ${{inputs.data_dir}}
  echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
  python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
  data_dir:
    type: uri_folder 
    path: wasbs://datasets@azuremlexamples.blob.core.windows.net/
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

And run:

az ml job create -f jobs/basics/hello-iris-folder.yml --web

Make sure you accurately specify the input type field to either type: uri_file or type: uri_folder corresponding to whether the data points to a single file or a folder. The default if the type field is omitted is uri_folder.

Private data

For private data in Azure Blob Storage or Azure Data Lake Storage connected to Azure Machine Learning through a datastore, you can use Azure Machine Learning URIs of the format azureml://datastores/<DATASTORE_NAME>/paths/<PATH_TO_DATA> for input data. For instance, if you upload the Iris CSV to a directory named /example-data/ in the Blob container corresponding to the datastore named workspaceblobstore you can modify a previous job to use the file in the datastore:

Warning

Running these jobs will fail for you if you have not copied the Iris CSV to the same location in workspaceblobstore.

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  echo "--iris-csv: ${{inputs.iris_csv}}"
  python hello-iris.py --iris-csv ${{inputs.iris_csv}}
code: src
inputs:
  iris_csv:
    type: uri_file 
    path: azureml://datastores/workspaceblobstore/paths/example-data/iris.csv
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

Or the entire directory:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: |
  ls ${{inputs.data_dir}}
  echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
  python hello-iris.py --iris-csv ${{inputs.data_dir}}/iris.csv
code: src
inputs:
  data_dir:
    type: uri_folder 
    path: azureml://datastores/workspaceblobstore/paths/example-data/
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:cpu-cluster

Default outputs

The ./outputs and ./logs directories receive special treatment by Azure Machine Learning. If you write any files to these directories during your job, these files will get uploaded to the job so that you can still access them once the job is complete. The ./outputs folder is uploaded at the end of the job, while the files written to ./logs are uploaded in real time. Use the latter if you want to stream logs during the job, such as TensorBoard logs.

In addition, any files logged from MLflow via autologging or mlflow.log_* for artifact logging will get automatically persisted as well. Collectively with the aforementioned ./outputs and ./logs directories, this set of files and directories will be persisted to a directory that corresponds to that job's default artifact location.

You can modify the "hello world" job to output to a file in the default outputs directory instead of printing to stdout:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ./outputs/helloworld.txt
environment:
  image: library/python:latest
compute: azureml:cpu-cluster

You can run this job:

az ml job create -f jobs/basics/hello-world-output.yml --web

And download the logs, where helloworld.txt will be present in the <RUN_ID>/outputs/ directory:

az ml job download -n $run_id

Data outputs

You can specify named data outputs. This will create a directory in the default datastore which will be read/write mounted by default.

You can modify the earlier "hello world" job to write to a named data output:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ${{outputs.hello_output}}/helloworld.txt
outputs:
  hello_output:
environment:
  image: python
compute: azureml:cpu-cluster

Hello pipelines

Pipeline jobs can run multiple jobs in parallel or in sequence. If there are input/output dependencies between steps in a pipeline, the dependent step will run after the other completes.

You can split a "hello world" job into two jobs:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline
jobs:
  hello_job:
    command: echo "hello"
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    compute: azureml:cpu-cluster
  world_job:
    command: echo "world"
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    compute: azureml:cpu-cluster

And run it:

az ml job create -f jobs/basics/hello-pipeline.yml --web

The "hello" and "world" jobs respectively will run in parallel if the compute target has the available resources to do so.

To pass data between steps in a pipeline, define a data output in the "hello" job and a corresponding input in the "world" job, which refers to the prior's output:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_io
jobs:
  hello_job:
    command: echo "hello" && echo "world" > ${{outputs.world_output}}/world.txt
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    compute: azureml:cpu-cluster
    outputs:
      world_output:
  world_job:
    command: cat ${{inputs.world_input}}/world.txt
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:23
    compute: azureml:cpu-cluster
    inputs:
      world_input: ${{parent.jobs.hello_job.outputs.world_output}}

And run it:

az ml job create -f jobs/basics/hello-pipeline-io.yml --web

This time, the "world" job will run after the "hello" job completes.

To avoid duplicating common settings across jobs in a pipeline, you can set them outside the jobs:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_settings

settings:
  default_datestore: azureml:workspaceblobstore
  default_compute: azureml:cpu-cluster
jobs:
  hello_job:
    command: echo 202204190 & echo "hello"
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:23
  world_job:
    command: echo 202204190 & echo "hello"
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:23

You can run this:

az ml job create -f jobs/basics/hello-pipeline-settings.yml --web

The corresponding setting on an individual job will override the common settings for a pipeline job. The concepts so far can be combined into a three-step pipeline job with jobs "A", "B", and "C". The "C" job has a data dependency on the "B" job, while the "A" job can run independently. The "A" job will also use an individually set environment and bind one of its inputs to a top-level pipeline job input:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: hello_pipeline_abc
compute: azureml:cpu-cluster
  
inputs:
  hello_string_top_level_input: "hello world"
jobs:
  a:
    command: echo hello ${{inputs.hello_string}}
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    inputs:
      hello_string: ${{parent.inputs.hello_string_top_level_input}}
  b:
    command: echo "world" >> ${{outputs.world_output}}/world.txt
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    outputs:
      world_output:
  c:
    command: echo ${{inputs.world_input}}/world.txt
    environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
    inputs:
      world_input: ${{parent.jobs.b.outputs.world_output}}

You can run this:

az ml job create -f jobs/basics/hello-pipeline-abc.yml --web

Train a model

In Azure Machine Learning you basically have two possible ways to train a model:

  1. Leverage automated ML to train models with your data and get the best model for you. This approach maximizes productivity by automating the iterative process of tuning hyperparameters and trying out different algorithms.
  2. Train a model with your own custom training script. This approach offers the most control and allows you to customize your training.

Train a model with automated ML

Automated ML is the easiest way to train a model because you don't need to know how training algorithms work exactly but you just need to provide your training/validation/test datasets and some basic configuration parameters such as 'ML Task', 'target column', 'primary metric, 'timeout' etc, and the service will train multiple models and try out various algorithms and hyperparameter combinations for you.

When you train with automated ML via the CLI (v2), you just need to create a .YAML file with an AutoML configuration and provide it to the CLI for training job creation and submission.

The following example shows an AutoML configuration file for training a classification model where,

  • The primary metric is accuracy
  • The training has a time out of 180 minutes
  • The data for training is in the folder "./training-mltable-folder". Automated ML jobs only accept data in the form of an MLTable.
$schema: https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.json
type: automl

experiment_name: dpv2-cli-automl-classifier-experiment
# name: dpv2-cli-classifier-train-job-basic-01
description: A Classification job using bank marketing

compute: azureml:cpu-cluster

task: classification
primary_metric: accuracy

target_column_name: "y"
training_data:
  path: "./training-mltable-folder"
  type: mltable

limits:
  timeout_minutes: 180
  max_trials: 40
  enable_early_termination: true

featurization:
  mode: auto

That mentioned MLTable definition is what points to the training data file, in this case a local .csv file that will be uploaded automatically:

paths:
  - file: ./bank_marketing_train_data.csv
transformations:
  - read_delimited:
        delimiter: ','
        encoding: 'ascii'

Finally, you can run it (create the AutoML job) with this CLI command:

/> az ml job create --file ./hello-automl-job-basic.yml 

Or like the following if providing workspace IDs explicitly instead of using the by default workspace:

/> az ml job create --file ./hello-automl-job-basic.yml --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

To investigate additional AutoML model training examples using other ML-tasks such as regression, time-series forecasting, image classification, object detection, NLP text-classification, etc., see the complete list of AutoML CLI examples.

Train a model with a custom script

When training by using your own custom script, the first thing you need is that python script (.py), so let's add some sklearn code into a Python script with MLflow tracking to train a model on the Iris CSV:

# imports
import os
import mlflow
import argparse

import pandas as pd

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# define functions
def main(args):
    # enable auto logging
    mlflow.autolog()

    # setup parameters
    params = {
        "C": args.C,
        "kernel": args.kernel,
        "degree": args.degree,
        "gamma": args.gamma,
        "coef0": args.coef0,
        "shrinking": args.shrinking,
        "probability": args.probability,
        "tol": args.tol,
        "cache_size": args.cache_size,
        "class_weight": args.class_weight,
        "verbose": args.verbose,
        "max_iter": args.max_iter,
        "decision_function_shape": args.decision_function_shape,
        "break_ties": args.break_ties,
        "random_state": args.random_state,
    }

    # read in data
    df = pd.read_csv(args.iris_csv)

    # process data
    X_train, X_test, y_train, y_test = process_data(df, args.random_state)

    # train model
    model = train_model(params, X_train, X_test, y_train, y_test)


def process_data(df, random_state):
    # split dataframe into X and y
    X = df.drop(["species"], axis=1)
    y = df["species"]

    # train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state
    )

    # return split data
    return X_train, X_test, y_train, y_test


def train_model(params, X_train, X_test, y_train, y_test):
    # train model
    model = SVC(**params)
    model = model.fit(X_train, y_train)

    # return model
    return model


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--iris-csv", type=str)
    parser.add_argument("--C", type=float, default=1.0)
    parser.add_argument("--kernel", type=str, default="rbf")
    parser.add_argument("--degree", type=int, default=3)
    parser.add_argument("--gamma", type=str, default="scale")
    parser.add_argument("--coef0", type=float, default=0)
    parser.add_argument("--shrinking", type=bool, default=False)
    parser.add_argument("--probability", type=bool, default=False)
    parser.add_argument("--tol", type=float, default=1e-3)
    parser.add_argument("--cache_size", type=float, default=1024)
    parser.add_argument("--class_weight", type=dict, default=None)
    parser.add_argument("--verbose", type=bool, default=False)
    parser.add_argument("--max_iter", type=int, default=-1)
    parser.add_argument("--decision_function_shape", type=str, default="ovr")
    parser.add_argument("--break_ties", type=bool, default=False)
    parser.add_argument("--random_state", type=int, default=42)

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()

    # run main function
    main(args)

The scikit-learn framework is supported by MLflow for autologging, so a single mlflow.autolog() call in the script will log all model parameters, training metrics, model artifacts, and some extra artifacts (in this case a confusion matrix image).

To run this in the cloud, specify as a job:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python main.py 
  --iris-csv ${{inputs.iris_csv}}
  --C ${{inputs.C}}
  --kernel ${{inputs.kernel}}
  --coef0 ${{inputs.coef0}}
inputs:
  iris_csv: 
    type: uri_file
    path: wasbs://datasets@azuremlexamples.blob.core.windows.net/iris.csv
  C: 0.8
  kernel: "rbf"
  coef0: 0.1
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
compute: azureml:cpu-cluster
display_name: sklearn-iris-example
experiment_name: sklearn-iris-example
description: Train a scikit-learn SVM on the Iris dataset.

And run it:

az ml job create -f jobs/single-step/scikit-learn/iris/job.yml --web

To register a model, you can upload the model files from the run to the model registry:

az ml model create -n sklearn-iris-example -v 1 -p runs:/$run_id/model --type mlflow_model

For the full set of configurable options for running command jobs, see the command job YAML schema reference.

Sweep hyperparameters

You can modify the previous job to sweep over hyperparameters:

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
  code: src
  command: >-
    python main.py 
    --iris-csv ${{inputs.iris_csv}}
    --C ${{search_space.C}}
    --kernel ${{search_space.kernel}}
    --coef0 ${{search_space.coef0}}
  environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
inputs:
  iris_csv: 
    type: uri_file
    path: wasbs://datasets@azuremlexamples.blob.core.windows.net/iris.csv
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
  C:
    type: uniform
    min_value: 0.5
    max_value: 0.9
  kernel:
    type: choice
    values: ["rbf", "linear", "poly"]
  coef0:
    type: uniform
    min_value: 0.1
    max_value: 1
objective:
  goal: minimize
  primary_metric: training_f1_score
limits:
  max_total_trials: 20
  max_concurrent_trials: 10
  timeout: 7200
display_name: sklearn-iris-sweep-example
experiment_name: sklearn-iris-sweep-example
description: Sweep hyperparemeters for training a scikit-learn SVM on the Iris dataset.

And run it:

az ml job create -f jobs/single-step/scikit-learn/iris/job-sweep.yml --web

Tip

Check the "Child runs" tab in the studio to monitor progress and view parameter charts..

For the full set of configurable options for sweep jobs, see the sweep job YAML schema reference.

Distributed training

Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed training. See the distributed section of the command job YAML syntax reference for details.

As an example, you can train a convolutional neural network (CNN) on the CIFAR-10 dataset using distributed PyTorch. The full script is available in the examples repository.

The CIFAR-10 dataset in torchvision expects as input a directory that contains the cifar-10-batches-py directory. You can download the zipped source and extract into a local directory:


mkdir data

wget "https://azuremlexamples.blob.core.windows.net/datasets/cifar-10-python.tar.gz"

tar -xvzf cifar-10-python.tar.gz -C data

Then create an Azure Machine Learning data asset from the local directory, which will be uploaded to the default datastore:


az ml data create --name cifar-10-example --version 1 --set path=data

Optionally, remove the local file and directory:


rm cifar-10-python.tar.gz

rm -r data

Registered data assets can be used as inputs to job using the path field for a job input. The format is azureml:<data_name>:<data_version>, so for the CIFAR-10 dataset just created, it is azureml:cifar-10-example:1. You can optionally use the azureml:<data_name>@latest syntax instead if you want to reference the latest version of the data asset. Azure ML will resolve that reference to the explicit version.

With the data asset in place, you can author a distributed PyTorch job to train our model:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: src
command: >-
  python train.py 
  --epochs ${{inputs.epochs}}
  --learning-rate ${{inputs.learning_rate}}
  --data-dir ${{inputs.cifar}}
inputs:
  epochs: 1
  learning_rate: 0.2
  cifar: 
     type: uri_folder
     path: azureml:cifar-10-example:1
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu@latest
compute: azureml:gpu-cluster
distribution:
  type: pytorch 
  process_count_per_instance: 1
resources:
  instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.

And run it:

az ml job create -f jobs/single-step/pytorch/cifar-distributed/job.yml --web

Build a training pipeline

The CIFAR-10 example above translates well to a pipeline job. The previous job can be split into three jobs for orchestration in a pipeline:

  • "get-data" to run a Bash script to download and extract cifar-10-batches-py
  • "train-model" to take the data and train a model with distributed PyTorch
  • "eval-model" to take the data and the trained model and evaluate accuracy

Both "train-model" and "eval-model" will have a dependency on the "get-data" job's output. Additionally, "eval-model" will have a dependency on the "train-model" job's output. Thus the three jobs will run sequentially.

Pipelines can also be written using reusable components. For more, see Create and run components-based machine learning pipelines with the Azure Machine Learning CLI (Preview).

Next steps