Train models with the CLI (v2) (preview)
The Azure Machine Learning CLI (v2) is an Azure CLI extension enabling you to accelerate the model training process while scaling up and out on Azure compute, with the model lifecycle tracked and auditable.
Training a machine learning model is typically an iterative process. Modern tooling makes it easier than ever to train larger models on more data faster. Previously tedious manual processes like hyperparameter tuning and even algorithm selection are often automated. With the Azure Machine Learning CLI (v2), you can track your jobs (and models) in a workspace with hyperparameter sweeps, scale-up on high-performance Azure compute, and scale-out utilizing distributed training.
Important
This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Prerequisites
- To use the CLI (v2), you must have an Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.
- Install and set up CLI (v2).
Tip
For a full-featured development environment with schema validation and autocompletion for job YAMLs, use Visual Studio Code and the Azure Machine Learning extension.
Clone examples repository
To run the training examples, first clone the examples repository and change into the cli directory:
git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/cli
Using --depth 1 clones only the latest commit to the repository, which reduces time to complete the operation.
Create compute
You can create an Azure Machine Learning compute cluster from the command line. For instance, the following commands will create one cluster named cpu-cluster and one named gpu-cluster.
az ml compute create -n cpu-cluster --type amlcompute --min-instances 0 --max-instances 8
az ml compute create -n gpu-cluster --type amlcompute --min-instances 0 --max-instances 4 --size Standard_NC12
You are not charged for compute at this point as cpu-cluster and gpu-cluster will remain at zero nodes until a job is submitted. Learn more about how to manage and optimize cost for AmlCompute.
The following example jobs in this article use one of cpu-cluster or gpu-cluster. Adjust these names in the example jobs throughout this article as needed to the name of your cluster(s). Use az ml compute create -h for more details on compute create options.
Hello world
For the Azure Machine Learning CLI (v2), jobs are authored in YAML format. A job aggregates:
- What to run
- How to run it
- Where to run it
The "hello world" job has all three:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
image: python:latest
compute: azureml:cpu-cluster
Warning
Python must be installed in the environment used for jobs. Run apt-get update -y && apt-get install python3 -y in your Dockerfile to install if needed, or derive from a base image with Python installed already.
Tip
The $schema: throughout examples allows for schema validation and autocompletion if authoring YAML files in VSCode with the Azure Machine Learning extension.
Which you can run:
az ml job create -f jobs/basics/hello-world.yml --web
Tip
The --web parameter will attempt to open your job in the Azure Machine Learning studio using your default web browser. The --stream parameter can be used to stream logs to the console and block further commands.
Overriding values on create or update
YAML job specification values can be overridden using --set when creating or updating a job. For instance:
az ml job create -f jobs/basics/hello-world.yml \
--set environment.image="python:3.8" \
--web
Job names
Most az ml job commands other than create and list require --name/-n, which is a job's name or "Run ID" in the studio. You should not directly set a job's name property during creation as it must be unique per workspace. Azure Machine Learning generates a random GUID for the job name if it is not set which can be obtained from the output of job creation in the CLI or by copying the "Run ID" property in the studio and MLflow APIs.
To automate jobs in scripts and CI/CD flows, you can capture a job's name when it is created by querying and stripping the output by adding --query name -o tsv. The specifics will vary by shell, but for Bash:
run_id=$(az ml job create -f jobs/basics/hello-world.yml --query name -o tsv)
Then use $run_id in subsequent commands like update, show, or stream:
az ml job show -n $run_id --web
Organize jobs
To organize jobs, you can set a display name, experiment name, description, and tags. Descriptions support markdown syntax in the studio. These properties are mutable after a job is created. A full example:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
image: python:latest
compute: azureml:cpu-cluster
tags:
hello: world
display_name: hello-world-example
experiment_name: hello-world-example
description: |
# Azure Machine Learning "hello world" job
This is a "hello world" job running in the cloud via Azure Machine Learning!
## Description
Markdown is supported in the studio for job descriptions! You can edit the description there or via CLI.
You can run this job, where these properties will be immediately visible in the studio:
az ml job create -f jobs/basics/hello-world-org.yml --web
Using --set you can update the mutable values after the job is created:
az ml job update -n $run_id --set \
display_name="updated display name" \
experiment_name="updated experiment name" \
description="updated description" \
tags.hello="updated tag"
Environment variables
You can set environment variables for use in your job:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo $hello_env_var
environment:
image: python:latest
compute: azureml:cpu-cluster
environment_variables:
hello_env_var: "hello world"
You can run this job:
az ml job create -f jobs/basics/hello-world-env-var.yml --web
Warning
You should use inputs for parameterizing arguments in the command. See inputs and outputs.
Track models and source code
Production machine learning models need to be auditable (if not reproducible). It is crucial to keep track of the source code for a given model. Azure Machine Learning takes a snapshot of your source code and keeps it with the job. Additionally, the source repository and commit are kept if you are running jobs from a Git repository.
Tip
If you're following along and running from the examples repository, you can see the source repository and commit in the studio on any of the jobs run so far.
You can specify the code.local_path key in a job with the value as the path to a source code directory. A snapshot of the directory is taken and uploaded with the job. The contents of the directory are directly available from the working directory of the job.
Warning
The source code should not include large data inputs for model training. Instead, use data inputs. You can use a .gitignore file in the source code directory to exclude files from the snapshot. The limits for snapshot size are 300 MB or 2000 files.
Let's look at a job that specifies code:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
pip install mlflow azureml-mlflow
&&
python hello-mlflow.py
code:
local_path: src
environment:
image: python:3.8
compute: azureml:cpu-cluster
The Python script is in the local source code directory. The command then invokes python to run the script. The same pattern can be applied for other programming languages.
Warning
The "hello" family of jobs shown in this article are for demonstration purposes and do not necessarily follow recommended best practices. Using && or similar to run many commands in a sequence is not recommended -- instead, consider writing the commands to a script file in the source code directory and invoking the script in your command. Installing dependencies in the command, as shown above via pip install, is not recommended -- instead, all job dependencies should be specified as part of your environment. See how to manage environments with the CLI (v2) for details.
Model tracking with MLflow
While iterating on models, data scientists need to be able to keep track of model parameters and training metrics. Azure Machine Learning integrates with MLflow tracking to enable the logging of models, artifacts, metrics, and parameters to a job. To use MLflow in your Python scripts add import mlflow and call mlflow.log_* or mlflow.autolog() APIs in your training code.
Warning
The mlflow and azureml-mlflow packages must be installed in your Python environment for MLflow tracking features.
Tip
The mlflow.autolog() call is supported for many popular frameworks and takes care of the majority of logging for you.
Let's take a look at Python script invoked in the job above that uses mlflow to log a parameter, a metric, and an artifact:
# imports
import os
import mlflow
from random import random
# define functions
def main():
mlflow.log_param("hello_param", "world")
mlflow.log_metric("hello_metric", random())
os.system(f"echo 'hello world' > helloworld.txt")
mlflow.log_artifact("helloworld.txt")
# run functions
if __name__ == "__main__":
# run main function
main()
You can run this job in the cloud via Azure Machine Learning, where it is tracked and auditable:
az ml job create -f jobs/basics/hello-mlflow.yml --web
Query metrics with MLflow
After running jobs, you might want to query the jobs' run results and their logged metrics. Python is better suited for this task than a CLI. You can query runs and their metrics via mlflow and load into familiar objects like Pandas dataframes for analysis.
First, retrieve the MLflow tracking URI for your Azure Machine Learning workspace:
az ml workspace show --query mlflow_tracking_uri -o tsv
Use the output of this command in mlflow.set_tracking_uri(<YOUR_TRACKING_URI>) from a Python environment with MLflow imported. MLflow calls will now correspond to jobs in your Azure Machine Learning workspace.
Inputs and outputs
Jobs typically have inputs and outputs. Inputs can be model parameters, which might be swept over for hyperparameter optimization, or cloud data inputs that are mounted or downloaded to the compute target. Outputs (ignoring metrics) are artifacts that can be written or copied to the default outputs or a named data output.
Literal inputs
Literal inputs are directly resolved in the command. You can modify our "hello world" job to use literal inputs:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
echo ${{inputs.hello_string}}
&&
echo ${{inputs.hello_number}}
environment:
image: python:latest
inputs:
hello_string: "hello world"
hello_number: 42
compute: azureml:cpu-cluster
You can run this job:
az ml job create -f jobs/basics/hello-world-input.yml --web
You can use --set to override inputs:
az ml job create -f jobs/basics/hello-world-input.yml --set \
inputs.hello_string="hello there" \
inputs.hello_number=24 \
--web
Literal inputs to jobs can be converted to search space inputs for hyperparameter sweeps on model training.
Search space inputs
For a sweep job, you can specify a search space for literal inputs to be chosen from. For the full range of options for search space inputs, see the sweep job YAML syntax reference.
Warning
Sweep jobs are not currently supported in pipeline jobs.
Let's demonstrate the concept with a simple Python script that takes in arguments and logs a random metric:
# imports
import os
import mlflow
import argparse
from random import random
# define functions
def main(args):
# print inputs
print(f"A: {args.A}")
print(f"B: {args.B}")
print(f"C: {args.C}")
# log inputs as parameters
mlflow.log_param("A", args.A)
mlflow.log_param("B", args.B)
mlflow.log_param("C", args.C)
# log a random metric
mlflow.log_metric("random_metric", random())
def parse_args():
# setup arg parser
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument("--A", type=float, default=0.5)
parser.add_argument("--B", type=str, default="hello world")
parser.add_argument("--C", type=float, default=1.0)
# parse args
args = parser.parse_args()
# return args
return args
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
# run main function
main(args)
And create a corresponding sweep job:
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
pip install mlflow azureml-mlflow
&&
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code:
local_path: src
environment:
image: python:3.8
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
And run it:
az ml job create -f jobs/basics/hello-sweep.yml --web
Data inputs
Data inputs are resolved to a path on the job compute's local filesystem. Let's demonstrate with the classic Iris dataset, which is hosted publicly in a blob container at https://azuremlexamples.blob.core.windows.net/datasets/iris.csv.
You can author a Python script that takes the path to the Iris CSV file as an argument, reads it into a dataframe, prints the first 5 lines, and saves it to the outputs directory.
# imports
import os
import argparse
import pandas as pd
# define functions
def main(args):
# read in data
df = pd.read_csv(args.iris_csv)
# print first 5 lines
print(df.head())
# ensure outputs directory exists
os.makedirs("outputs", exist_ok=True)
# save data to outputs
df.to_csv("outputs/iris.csv", index=False)
def parse_args():
# setup arg parser
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument("--iris-csv", type=str)
# parse args
args = parser.parse_args()
# return args
return args
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
# run main function
main(args)
Azure storage URI inputs can be specified, which will mount or download data to the local filesystem. You can specify a single file:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
echo "--iris-csv: ${{inputs.iris_csv}}"
&&
pip install pandas
&&
python hello-iris.py
--iris-csv ${{inputs.iris_csv}}
code:
local_path: src
inputs:
iris_csv:
file: https://azuremlexamples.blob.core.windows.net/datasets/iris.csv
environment:
image: python:latest
compute: azureml:cpu-cluster
And run:
az ml job create -f jobs/basics/hello-iris-file.yml --web
Or specify an entire folder:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
ls ${{inputs.data_dir}}
&&
echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
&&
pip install pandas
&&
python hello-iris.py
--iris-csv ${{inputs.data_dir}}/iris.csv
code:
local_path: src
inputs:
data_dir:
folder: wasbs://datasets@azuremlexamples.blob.core.windows.net/
environment:
image: python:latest
compute: azureml:cpu-cluster
And run:
az ml job create -f jobs/basics/hello-iris-folder.yml --web
Private data
For private data in Azure Blob Storage or Azure Data Lake Storage connected to Azure Machine Learning through a datastore, you can use Azure Machine Learning URIs of the format azureml://datastores/<DATASTORE_NAME>/paths/<PATH_TO_DATA> for input data. For instance, if you upload the Iris CSV to a directory named /example-data/ in the Blob container corresponding to the datastore named workspaceblobstore you can modify a previous job to use the file in the datastore:
Warning
Running these jobs will fail for you if you have not copied the Iris CSV to the same location in workspaceblobstore.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
echo "--iris-csv: ${{inputs.iris_csv}}"
&&
pip install pandas
&&
python hello-iris.py
--iris-csv ${{inputs.iris_csv}}
code:
local_path: src
inputs:
iris_csv:
file: azureml://datastores/workspaceblobstore/paths/example-data/iris.csv
environment:
image: python:latest
compute: azureml:cpu-cluster
Or the entire directory:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >-
ls ${{inputs.data_dir}}
&&
echo "--iris-csv: ${{inputs.data_dir}}/iris.csv"
&&
pip install pandas
&&
python hello-iris.py
--iris-csv ${{inputs.data_dir}}/iris.csv
code:
local_path: src
inputs:
data_dir:
folder: azureml://datastores/workspaceblobstore/paths/example-data/
mode: rw_mount
environment:
image: python:latest
compute: azureml:cpu-cluster
Default outputs
The ./outputs and ./logs directories receive special treatment by Azure Machine Learning. If you write any files to these directories during your job, these files will get uploaded to the job so that you can still access them once it is complete. The ./outputs folder is uploaded at the end of the job, while the files written to ./logs are uploaded in real time. Use the latter if you want to stream logs during the job, such as TensorBoard logs.
You can modify the "hello world" job to output to a file in the default outputs directory instead of printing to stdout:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ./outputs/helloworld.txt
environment:
image: python:latest
compute: azureml:cpu-cluster
You can run this job:
az ml job create -f jobs/basics/hello-world-output.yml --web
And download the logs, where helloworld.txt will be present in the <RUN_ID>/outputs/ directory:
az ml job download -n $run_id
Data outputs
You can specify named data outputs. This will create a directory in the default datastore which will be read/write mounted by default.
You can modify the earlier "hello world" job to write to a named data output:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world" > ${{outputs.hello_output}}/helloworld.txt
outputs:
hello_output:
environment:
image: python
compute: azureml:cpu-cluster
Hello pipelines
Pipeline jobs can run multiple jobs in parallel or in sequence. If there are input/output dependencies between steps in a pipeline, the dependent step will run after the other completes.
You can split a "hello world" job into two jobs:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
jobs:
hello_job:
command: echo "hello"
environment:
image: python:latest
compute: azureml:cpu-cluster
world_job:
command: echo "world"
environment:
image: python
compute: azureml:cpu-cluster
And run it:
az ml job create -f jobs/basics/hello-pipeline.yml --web
The "hello" and "world" jobs respectively will run in parallel if the compute target has the available resources to do so.
To pass data between steps in a pipeline, define a data output in the "hello" job and a corresponding input in the "world" job, which refers to the prior's output:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
jobs:
hello_job:
command: echo "hello" && echo "world" > ${{outputs.world_output}}/world.txt
environment:
image: python:latest
compute: azureml:cpu-cluster
outputs:
world_output:
world_job:
command: cat ${{inputs.world_input}}/world.txt
environment:
image: python:latest
compute: azureml:cpu-cluster
inputs:
world_input: ${{jobs.hello_job.outputs.world_output}}
And run it:
az ml job create -f jobs/basics/hello-pipeline-io.yml --web
This time, the "world" job will run after the "hello" job completes.
To avoid duplicating common settings across jobs in a pipeline, you can set them outside the jobs:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
compute: azureml:cpu-cluster
settings:
environment:
image: python:latest
jobs:
hello_job:
command: echo "hello"
world_job:
command: echo "world"
You can run this:
az ml job create -f jobs/basics/hello-pipeline-settings.yml --web
The corresponding setting on an individual job will override the common settings for a pipeline job. The concepts so far can be combined into a three-step pipeline job with jobs "A", "B", and "C". The "C" job has a data dependency on the "B" job, while the "A" job can run independently. The "A" job will also use an individually set environment and bind one of its inputs to a top-level pipeline job input:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
compute: azureml:cpu-cluster
settings:
environment:
image: python:latest
inputs:
hello_string_top_level_input: "hello world"
jobs:
A:
command: echo hello ${{inputs.hello_string}}
environment:
image: python:3.9
inputs:
hello_string: ${{inputs.hello_string_top_level_input}}
B:
command: echo "world" >> ${{outputs.world_output}}/world.txt
outputs:
world_output:
C:
command: echo ${{inputs.world_input}}/world.txt
inputs:
world_input: ${{jobs.B.outputs.world_output}}
You can run this:
az ml job create -f jobs/basics/hello-pipeline-abc.yml --web
Train a model
At this point, a model still hasn't been trained. Let's add some sklearn code into a Python script with MLflow tracking to train a model on the Iris CSV:
# imports
import os
import mlflow
import argparse
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# define functions
def main(args):
# enable auto logging
mlflow.autolog()
# setup parameters
params = {
"C": args.C,
"kernel": args.kernel,
"degree": args.degree,
"gamma": args.gamma,
"coef0": args.coef0,
"shrinking": args.shrinking,
"probability": args.probability,
"tol": args.tol,
"cache_size": args.cache_size,
"class_weight": args.class_weight,
"verbose": args.verbose,
"max_iter": args.max_iter,
"decision_function_shape": args.decision_function_shape,
"break_ties": args.break_ties,
"random_state": args.random_state,
}
# read in data
df = pd.read_csv(args.iris_csv)
# process data
X_train, X_test, y_train, y_test = process_data(df, args.random_state)
# train model
model = train_model(params, X_train, X_test, y_train, y_test)
def process_data(df, random_state):
# split dataframe into X and y
X = df.drop(["species"], axis=1)
y = df["species"]
# train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=random_state
)
# return split data
return X_train, X_test, y_train, y_test
def train_model(params, X_train, X_test, y_train, y_test):
# train model
model = SVC(**params)
model = model.fit(X_train, y_train)
# return model
return model
def parse_args():
# setup arg parser
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument("--iris-csv", type=str)
parser.add_argument("--C", type=float, default=1.0)
parser.add_argument("--kernel", type=str, default="rbf")
parser.add_argument("--degree", type=int, default=3)
parser.add_argument("--gamma", type=str, default="scale")
parser.add_argument("--coef0", type=float, default=0)
parser.add_argument("--shrinking", type=bool, default=False)
parser.add_argument("--probability", type=bool, default=False)
parser.add_argument("--tol", type=float, default=1e-3)
parser.add_argument("--cache_size", type=float, default=1024)
parser.add_argument("--class_weight", type=dict, default=None)
parser.add_argument("--verbose", type=bool, default=False)
parser.add_argument("--max_iter", type=int, default=-1)
parser.add_argument("--decision_function_shape", type=str, default="ovr")
parser.add_argument("--break_ties", type=bool, default=False)
parser.add_argument("--random_state", type=int, default=42)
# parse args
args = parser.parse_args()
# return args
return args
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
# run main function
main(args)
The scikit-learn framework is supported by MLflow for autologging, so a single mlflow.autolog() call in the script will log all model parameters, training metrics, model artifacts, and some extra artifacts (in this case a confusion matrix image).
To run this in the cloud, specify as a job:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code:
local_path: src
command: >-
python main.py
--iris-csv ${{inputs.iris_csv}}
--C ${{inputs.C}}
--kernel ${{inputs.kernel}}
--coef0 ${{inputs.coef0}}
inputs:
iris_csv:
file: wasbs://datasets@azuremlexamples.blob.core.windows.net/iris.csv
C: 0.8
kernel: "rbf"
coef0: 0.1
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9
compute: azureml:cpu-cluster
display_name: sklearn-iris-example
experiment_name: sklearn-iris-example
description: Train a scikit-learn SVM on the Iris dataset.
And run it:
az ml job create -f jobs/single-step/scikit-learn/iris/job.yml --web
To register a model, you can download the outputs and create a model from the local directory:
az ml job download -n $run_id
az ml model create -n sklearn-iris-example -l $run_id/model/
Sweep hyperparameters
You can modify the previous job to sweep over hyperparameters:
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
code:
local_path: src
command: >-
python main.py
--iris-csv ${{inputs.iris_csv}}
--C ${{search_space.C}}
--kernel ${{search_space.kernel}}
--coef0 ${{search_space.coef0}}
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu:9
inputs:
iris_csv:
file: wasbs://datasets@azuremlexamples.blob.core.windows.net/iris.csv
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
C:
type: uniform
min_value: 0.5
max_value: 0.9
kernel:
type: choice
values: ["rbf", "linear", "poly"]
coef0:
type: uniform
min_value: 0.1
max_value: 1
objective:
goal: minimize
primary_metric: training_f1_score
limits:
max_total_trials: 20
max_concurrent_trials: 10
timeout: 7200
display_name: sklearn-iris-sweep-example
experiment_name: sklearn-iris-sweep-example
description: Sweep hyperparemeters for training a scikit-learn SVM on the Iris dataset.
And run it:
az ml job create -f jobs/single-step/scikit-learn/iris/job-sweep.yml --web
Tip
Check the "Child runs" tab in the studio to monitor progress and view parameter charts..
For more sweep options, see the sweep job YAML syntax reference.
Distributed training
Azure Machine Learning supports PyTorch, TensorFlow, and MPI-based distributed training. See the distributed section of the command job YAML syntax reference for details.
As an example, you can train a convolutional neural network (CNN) on the CIFAR-10 dataset using distributed PyTorch. The full script is available in the examples repository.
The CIFAR-10 dataset in torchvision expects as input a directory that contains the cifar-10-batches-py directory. You can download the zipped source and extract into a local directory:
mkdir data
wget "https://azuremlexamples.blob.core.windows.net/datasets/cifar-10-python.tar.gz"
tar -xvzf cifar-10-python.tar.gz -C data
Then create an Azure Machine Learning dataset from the local directory, which will be uploaded to the default datastore:
az ml dataset create --name cifar-10-example --version 1 --set local_path=data
Optionally, remove the local file and directory:
rm cifar-10-python.tar.gz
rm -r data
Datasets (File only) can be referred to in a job using the dataset key of a data input. The format is azureml:<DATASET_NAME>:<DATASET_VERSION>, so for the CIFAR-10 dataset just created, it is azureml:cifar-10-example:1.
With the dataset in place, you can author a distributed PyTorch job to train our model:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code:
local_path: src
command: >-
python train.py
--epochs ${{inputs.epochs}}
--learning-rate ${{inputs.learning_rate}}
--data-dir ${{inputs.cifar}}
inputs:
epochs: 1
learning_rate: 0.2
cifar:
dataset: azureml:cifar-10-example:1
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 2
resources:
instance_count: 2
display_name: pytorch-cifar-distributed-example
experiment_name: pytorch-cifar-distributed-example
description: Train a basic convolutional neural network (CNN) with PyTorch on the CIFAR-10 dataset, distributed via PyTorch.
And run it:
az ml job create -f jobs/single-step/pytorch/cifar-distributed/job.yml --web
Build a training pipeline
The CIFAR-10 example above translates well to a pipeline job. The previous job can be split into three jobs for orchestration in a pipeline:
- "get-data" to run a Bash script to download and extract
cifar-10-batches-py - "train-model" to take the data and train a model with distributed PyTorch
- "eval-model" to take the data and the trained model and evaluate accuracy
Both "train-model" and "eval-model" will have a dependency on the "get-data" job's output. Additionally, "eval-model" will have a dependency on the "train-model" job's output. Thus the three jobs will run sequentially.
You can orchestrate these three jobs within a pipeline job:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: cifar-10-pipeline-example
experiment_name: cifar-10-pipeline-example
jobs:
get-data:
command: bash main.sh ${{outputs.cifar}}
code:
local_path: src/get-data
environment:
image: python:latest
compute: azureml:cpu-cluster
outputs:
cifar:
train-model:
command: >-
python main.py
--data-dir ${{inputs.cifar}}
--epochs ${{inputs.epochs}}
--model-dir ${{outputs.model_dir}}
code:
local_path: src/train-model
inputs:
epochs: 1
cifar: ${{jobs.get-data.outputs.cifar}}
outputs:
model_dir:
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 2
resources:
instance_count: 2
eval-model:
command: >-
python main.py
--data-dir ${{inputs.cifar}}
--model-dir ${{inputs.model_dir}}/model
code:
local_path: src/eval-model
environment: azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6
compute: azureml:gpu-cluster
distribution:
type: pytorch
process_count_per_instance: 2
resources:
instance_count: 1
inputs:
cifar: ${{jobs.get-data.outputs.cifar}}
model_dir: ${{jobs.train-model.outputs.model_dir}}
And run:
az ml job create -f jobs/pipelines/cifar-10/job.yml --web
Pipelines can also be written using reusable components. For more, see Create and run components-based machine learning pipelines with the Azure Machine Learning CLI (Preview).