Train models (create jobs) with the CLI (v2)

The Azure Machine Learning CLI (v2) is an Azure CLI extension enabling you to accelerate the model training process while scaling up and out on Azure compute, with the model lifecycle tracked and auditable.

Training a machine learning model is typically an iterative process. Modern tooling makes it easier than ever to train larger models on more data faster. Previously tedious manual processes like hyperparameter tuning and even algorithm selection are often automated. With the Azure Machine Learning CLI you can track your jobs (and models) in a workspace with hyperparameter sweeps, scale-up on high-performance Azure compute, and scale-out utilizing distributed training.

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Prerequisites

Tip

For a full-featured development environment, use Visual Studio Code and the Azure Machine Learning extension to manage Azure Machine Learning resources and train machine learning models.

Clone examples repository

To run the training examples, first clone the examples repository and change into the cli directory:

git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/cli

Note that --depth 1 clones only the latest commit to the repository which reduces time to complete the operation.

Create compute

You can create an Azure Machine Learning compute cluster from the command line. For instance, the following commands will create one cluster named cpu-cluster and one named gpu-cluster.

az ml compute create -n cpu-cluster --type amlcompute --min-instances 0 --max-instances 10 
az ml compute create -n gpu-cluster --type amlcompute --min-instances 0 --max-instances 4 --size Standard_NC12

Note that you are not charged for compute at this point as cpu-cluster and gpu-cluster will remain at 0 nodes until a job is submitted. Learn more about how to manage and optimize cost for AmlCompute.

The following example jobs in this article use one of cpu-cluster or gpu-cluster. Adjust these as needed to the name of your cluster(s).

Use az ml compute create -h for more details on compute create options.

Note

To create and attach a compute target for training on Azure Arc enabled Kubernetes cluster, see Configure Azure Arc enabled Machine Learning

Introducing jobs

For the Azure Machine Learning CLI (v2), jobs are authored in YAML format. A job aggregates:

  • What to run
  • How to run it
  • Where to run it

The "hello world" job has all three:

command: python -c "print('hello world')"
environment:
  docker:
    image: python
compute:
  target: azureml:cpu-cluster

Which you can run:

az ml job create -f jobs/misc/hello-world.yml --web

However this is just an example job which doesn't output anything other than a line in the log file. Typically you want to generate additional artifacts, such as model binaries and accompanying metadata, in addition to the system-generated logs.

Azure Machine Learning captures the following artifacts automatically:

  • The ./outputs and ./logs directories receive special treatment by Azure Machine Learning. If you write any files to these directories during your job, these files will get uploaded to the job's run history so that you can still access them once the job is complete. The ./outputs folder is uploaded at the end of the job, while the files written to ./logs are uploaded in real time. Use the latter if you want to stream logs during the job, such as TensorBoard logs.
  • Azure Machine Learning integrates with MLflow's tracking functionality. You can use mlflow.autolog() for several common ML frameworks to log model parameters, performance metrics, model artifacts, and even feature importance graphs. You can also use the mlflow.log_*() methods to explicitly log parameters, metrics, and artifacts. All MLflow-logged metrics and artifacts will be saved in the job's run history.

Often, a job involves running some source code that is edited and controlled locally. You can specify a source code directory to include in the job, from which the command will be run.

For instance, look at the jobs/train/lightgbm/iris project directory in the examples repository:

.
├── job-sweep.yml
├── job.yml
└── src
    └── main.py

This directory contains two job files and a source code subdirectory src. While this example only has a single file under src, the entire subdirectory is recursively uploaded and available for use in the job.

The command job is configured via the job.yml:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: 
  local_path: src
command: >-
  python main.py 
  --iris-csv {inputs.iris}
inputs:
  iris:
    data:
      path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
    mode: mount
environment: azureml:AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu:8
compute:
  target: azureml:cpu-cluster
experiment_name: lightgbm-iris-example
description: Train a LightGBM model on the Iris dataset.

Which you can run:

az ml job create -f jobs/train/lightgbm/iris/job.yml --web

Basic Python training job

Let's review the job YAML file in detail:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: 
  local_path: src
command: >-
  python main.py 
  --iris-csv {inputs.iris}
inputs:
  iris:
    data:
      path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
    mode: mount
environment: azureml:AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu:8
compute:
  target: azureml:cpu-cluster
experiment_name: lightgbm-iris-example
description: Train a LightGBM model on the Iris dataset.
Key Description
$schema [Optional] The YAML schema. You can view the schema in the above example in a browser to see all available options for a command job YAML file. If you use the Azure Machine Learning VS Code extension to author the YAML file, including this $schema property at the top of your file enables you to invoke schema and resource completions.
code.local_path [Optional] The local path to the source directory, relative to the YAML file, to be uploaded and used with the job. Consider using src in the same directory as the job file(s) for consistency.
command The command to execute. The > convention allows for authoring readable multiline commands by folding newlines to spaces. Command-line arguments can be explicitly written into the command or inferred from other sections, specifically inputs or search_space, using curly braces notation.
inputs [Optional] A dictionary of the input data bindings, where the key is a name that you specify for the input binding. The value for each element is the input binding, which consists of data and mode fields. data can either be 1) a reference to an existing versioned Azure Machine Learning data asset by using the azureml: prefix (e.g. azureml:iris-url:1 to point to version 1 of a data asset named "iris-url") or 2) an inline definition of the data. Use data.path to specify a cloud location. Use data.local_path to specify data from the local filesystem which will be uploaded to the default datastore. mode indicates how you want the data made available on the compute for the job. "mount" and "download" are the two supported options.

An input can be referred to in the command by its name, such as {inputs.my_input_name}. Azure Machine Learning will then resolve that parameterized notation in the command to the location of that data on the compute target during runtime. For example, if the data is configured to be mounted, {inputs.my_input_name} will resolve to the mount point.
environment The environment to execute the command on the compute target with. You can define the environment inline by specifying the Docker image to use or the Dockerfile for building the image. You can also refer to an existing versioned environment in the workspace, or one of Azure ML's curated environments, using the azureml: prefix. For instance, azureml:AzureML-TensorFlow2.4-Cuda11-OpenMpi4.1.0-py36:1 would refer to version 1 of a curated environment for TensorFlow with GPU support.

Python must be installed in the environment used for training. Run apt-get update -y && apt-get install python3 -y in your Dockerfile to install if needed.
compute.target The compute target. Specify local for local execution, or use the azureml: prefix to reference an existing compute resource in your workspace. For instance, azureml:cpu-cluster would point to a compute target named "cpu-cluster".
experiment_name [Optional] Tags the job for better organization in the Azure Machine Learning studio. Each job's run record will be organized under the corresponding experiment in the studio's "Experiment" tab. If omitted, it will default to the name of the working directory where the job was created.
name [Optional] The name of the job, which must be unique across all jobs in a workspace. Unless a name is specified either in the YAML file via the name field or the command line via --name/-n, a GUID/UUID is automatically generated and used for the name. The job name corresponds to the Run ID in the studio UI of the job's run record.

Creating this job uploads any specified local assets, like the source code directory, validates the YAML file, and submits the run. If needed, the environment is built, then the compute is scaled up and configured for running the job.

To run the lightgbm/iris training job:

az ml job create -f jobs/train/lightgbm/iris/job.yml --web

Once the job is complete, you can download the outputs:

az ml job download -n $run_id --outputs

Important

Replace $run_id with your run ID, which can be found in the console output or in the studio's run details page.

This will download the logs and any captured artifacts locally in a directory named $run_id. For this example, the MLflow-logged model subdirectory will be downloaded.

Sweep hyperparameters

Azure Machine Learning also enables you to more efficiently tune the hyperparameters for your machine learning models. You can configure a hyperparameter tuning job, called a sweep job, and submit it via the CLI.

You can modify the job.yml into job-sweep.yml to sweep over hyperparameters:

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep_job
algorithm: random
trial:
  code: 
    local_path: src 
  command: >-
    python main.py 
    --iris-csv {inputs.iris}
    --learning-rate {search_space.learning_rate}
    --boosting {search_space.boosting}
  inputs:
    iris:
      data:
        path: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
      mode: mount
  environment: azureml:AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu:8
  compute:
    target: azureml:cpu-cluster
search_space:
  learning_rate:
    type: uniform
    min_value: 0.01
    max_value: 0.9
  boosting:
    type: choice
    values: ["gbdt", "dart"]
objective:
  primary_metric: test-multi_logloss
  goal: minimize
max_total_trials: 20
max_concurrent_trials: 10
timeout_minutes: 120
experiment_name: lightgbm-iris-sweep-example
description: Run a hyperparameter sweep job for LightGBM on Iris dataset.
Key Description
$schema [Optional] The YAML schema, which has changed and now points to the sweep job schema.
type The job type.
algorithm The sampling algorithm - "random" is often a good choice. See the schema for the enumeration of options.
trial The command job configuration for each trial to be run. The command (trial.command) has been modified from the previous example to use the {search_space.<hyperparameter_name>} notation to reference the hyperparameters defined in the search_space. Azure Machine Learning will then resolve each parameterized notation to the value for the corresponding hyperparameter that it generates for each trial.
search_space A dictionary of the hyperparameters to sweep over. The key is a name for the hyperparameter, for example, search_space.learning_rate. Note that the name does not have to match the training script's argument itself, it just has to match the search space reference in the curly braces notation in the command, e.g. {search_space.learning_rate}. The value is the hyperparameter distribution. See the schema for the enumeration of options.
objective.primary_metric The optimization metric, which must match the name of a metric logged from the training code. objective.goal specifies the direction ("minimize"/"maximize"). See the schema for the full enumeration of options.
max_total_trials The maximum number of individual trials to run.
max_concurrent_trials [Optional] The maximum number of trials to run concurrently on your compute cluster.
timeout_minutes [Optional] The maximum number of minutes to run the sweep job for.
experiment_name [Optional] The experiment to track the sweep job under. If omitted, it will default to the name of the working directory when the job is created.

Create job and open in the studio:

az ml job create -f jobs/train/lightgbm/iris/job-sweep.yml --web

Tip

Hyperparameter sweeps can be used with distributed command jobs.

Distributed training

You can specify the distribution section in a command job. Azure ML supports distributed training for PyTorch, Tensorflow, and MPI compatible frameworks. PyTorch and TensorFlow enable native distributed training for the respective frameworks, such as tf.distributed.Strategy APIs for TensorFlow.

Be sure to set the compute.instance_count, which defaults to 1, to the desired number of nodes for the job.

PyTorch

An example YAML file for distributed PyTorch training on the CIFAR-10 dataset:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: 
  local_path: src
command: >-
  python train.py 
  --epochs 1
  --data-dir {inputs.cifar}
inputs:
  cifar:
    data:
      local_path: data
    mode: mount
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:3
compute:
  target: azureml:gpu-cluster
  instance_count: 2
distribution:
  type: pytorch 
  process_count: 2
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.

Notice this refers to local data, which is not present in the cloned examples repository. You first need to download, extract, and relocate the CIFAR-10 dataset locally, placing it in the proper location in the project directory:

wget "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz"
tar -xvzf cifar-10-python.tar.gz
rm cifar-10-python.tar.gz
mkdir jobs/train/pytorch/cifar-distributed/data
mv cifar-10-batches-py jobs/train/pytorch/cifar-distributed/data

Create the job and open in the studio:

az ml job create -f jobs/train/pytorch/cifar-distributed/job.yml --web

TensorFlow

An example YAML file for distributed TensorFlow training on the MNIST dataset:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code:
  local_path: src
command: >-
  python train.py 
  --epochs 1
  --model-dir outputs/keras-model
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu:11
compute:
  target: azureml:gpu-cluster
  instance_count: 2
distribution:
  type: tensorflow
  worker_count: 2
experiment_name: tensorflow-mnist-distributed-example
description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via TensorFlow.

Create the job and open in the studio:

az ml job create -f jobs/train/tensorflow/mnist-distributed/job.yml --web

MPI

Azure ML supports launching an MPI job across multiple nodes and multiple processes per node. It launches the job via mpirun. If your training code uses the Horovod framework for distributed training, for example, you can leverage this job type to train on Azure ML.

To launch an MPI job, specify mpi as the type and the number of processes per node to launch (process_count_per_instance) in the distribution section. If this field is not specified, Azure ML will default to launching one process per node.

An example YAML specification, which runs a TensorFlow job on MNIST using Horovod:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code:
  local_path: src
command: >-
  python train.py
  --epochs 1
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu:11
compute:
  target: azureml:gpu-cluster
  instance_count: 2
distribution:
  type: mpi
  process_count_per_instance: 2
experiment_name: tensorflow-mnist-distributed-horovod-example
description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via Horovod.

Create the job and open in the studio:

az ml job create -f jobs/train/tensorflow/mnist-distributed-horovod/job.yml --web

Next steps