Start, monitor, and cancel training runs in Python

APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

The Azure Machine Learning SDK for Python, Machine Learning CLI, and Azure Machine Learning studio provide various methods to monitor, organize, and manage your runs for training and experimentation.

This article shows examples of the following tasks:

  • Monitor run performance.
  • Cancel or fail runs.
  • Create child runs.
  • Tag and find runs.

Prerequisites

You'll need the following items:

Start a run and its logging process

Using the SDK

Set up your experiment by importing the Workspace, Experiment, Run, and ScriptRunConfig classes from the azureml.core package.

import azureml.core
from azureml.core import Workspace, Experiment, Run
from azureml.core import ScriptRunConfig

ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="explore-runs")

Start a run and its logging process with the start_logging() method.

notebook_run = exp.start_logging()
notebook_run.log(name="message", value="Hello from run!")

Using the CLI

To start a run of your experiment, use the following steps:

  1. From a shell or command prompt, use the Azure CLI to authenticate to your Azure subscription:

    az login
    

    After logging in, you see a list of subscriptions associated with your Azure account. The subscription information with isDefault: true is the currently activated subscription for Azure CLI commands. This subscription must be the same one that contains your Azure Machine Learning workspace. You can find the subscription ID from the Azure portal by visiting the overview page for your workspace. You can also use the SDK to get the subscription ID from the workspace object. For example, Workspace.from_config().subscription_id.

    To select another subscription, use the az account set command with the subscription ID to switch to. For more information about subscription selection, see Use multiple Azure Subscriptions.

  2. Attach a workspace configuration to the folder that contains your training script. Replace myworkspace with your Azure Machine Learning workspace. Replace myresourcegroup with the Azure resource group that contains your workspace:

    az ml folder attach -w myworkspace -g myresourcegroup
    

    This command creates a .azureml subdirectory that contains example runconfig and conda environment files. It also contains a config.json file that is used to communicate with your Azure Machine Learning workspace.

    For more information, see az ml folder attach.

  3. To start the run, use the following command. When using this command, specify the name of the runconfig file (the text before *.runconfig if you are looking at your file system) against the -c parameter.

    az ml run submit-script -c sklearn -e testexperiment train.py
    

    Tip

    The az ml folder attach command created a .azureml subdirectory, which contains two example runconfig files.

    If you have a Python script that creates a run configuration object programmatically, you can use RunConfig.save() to save it as a runconfig file.

    For more example runconfig files, see https://github.com/MicrosoftDocs/pipelines-azureml/tree/master/.azureml.

    For more information, see az ml run submit-script.

Using Azure Machine Learning studio

To start a submit a pipeline run in the designer (preview), use the following steps:

  1. Set a default compute target for your pipeline.

  2. Select Run at the top of the pipeline canvas.

  3. Select an Experiment to group your pipeline runs.

Monitor the status of a run

Using the SDK

Get the status of a run with the get_status() method.

print(notebook_run.get_status())

To get the run ID, execution time, and additional details about the run, use the get_details() method.

print(notebook_run.get_details())

When your run finishes successfully, use the complete() method to mark it as completed.

notebook_run.complete()
print(notebook_run.get_status())

If you use Python's with...as design pattern, the run will automatically mark itself as completed when the run is out of scope. You don't need to manually mark the run as completed.

with exp.start_logging() as notebook_run:
    notebook_run.log(name="message", value="Hello from run!")
    print(notebook_run.get_status())

print(notebook_run.get_status())

Using the CLI

  1. To view a list of runs for your experiment, use the following command. Replace experiment with the name of your experiment:

    az ml run list --experiment-name experiment
    

    This command returns a JSON document that lists information about runs for this experiment.

    For more information, see az ml experiment list.

  2. To view information on a specific run, use the following command. Replace runid with the ID of the run:

    az ml run show -r runid
    

    This command returns a JSON document that lists information about the run.

    For more information, see az ml run show.

Using Azure Machine Learning studio

To view the number of active runs for your experiment in the studio.

  1. Navigate to the Experiments section..

  2. Select an experiment.

    In the experiment page, you can see the number of active compute targets and the duration for each run.

  3. Select a specific run number.

  4. In the Logs tab, you can find diagnostic and error logs for your pipeline run.

Cancel or fail runs

If you notice a mistake or if your run is taking too long to finish, you can cancel the run.

Using the SDK

To cancel a run using the SDK, use the cancel() method:

run_config = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')
local_script_run = exp.submit(run_config)
print(local_script_run.get_status())

local_script_run.cancel()
print(local_script_run.get_status())

If your run finishes, but it contains an error (for example, the incorrect training script was used), you can use the fail() method to mark it as failed.

local_script_run = exp.submit(run_config)
local_script_run.fail()
print(local_script_run.get_status())

Using the CLI

To cancel a run using the CLI, use the following command. Replace runid with the ID of the run

az ml run cancel -r runid -w workspace_name -e experiment_name

For more information, see az ml run cancel.

Using Azure Machine Learning studio

To cancel a run in the studio, using the following steps:

  1. Go to the running pipeline in either the Experiments or Pipelines section.

  2. Select the pipeline run number you want to cancel.

  3. In the toolbar, select Cancel

Create child runs

Create child runs to group together related runs, such as for different hyperparameter-tuning iterations.

Note

Child runs can only be created using the SDK.

This code example uses the hello_with_children.py script to create a batch of five child runs from within a submitted run by using the child_run() method:

!more hello_with_children.py
run_config = ScriptRunConfig(source_directory='.', script='hello_with_children.py')

local_script_run = exp.submit(run_config)
local_script_run.wait_for_completion(show_output=True)
print(local_script_run.get_status())

with exp.start_logging() as parent_run:
    for c,count in enumerate(range(5)):
        with parent_run.child_run() as child:
            child.log(name="Hello from child run", value=c)

Note

As they move out of scope, child runs are automatically marked as completed.

To create many child runs efficiently, use the create_children() method. Because each creation results in a network call, creating a batch of runs is more efficient than creating them one by one.

Submit child runs

Child runs can also be submitted from a parent run. This allows you to create hierarchies of parent and child runs, each running on different compute targets, connected by common parent run ID.

Use the 'submit_child()' method to submit a child run from within a parent run. To do this in the parent run script, get the run context and submit the child run using the submit_child method of the context instance.

## In parent run script
parent_run = Run.get_context()
child_run_config = ScriptRunConfig(source_directory='.', script='child_script.py')
parent_run.submit_child(child_run_config)

Within a child run, you can view the parent run ID:

## In child run script
child_run = Run.get_context()
child_run.parent.id

Query child runs

To query the child runs of a specific parent, use the get_children() method. The recursive = True argument allows you to query a nested tree of children and grandchildren.

print(parent_run.get_children())

Tag and find runs

In Azure Machine Learning, you can use properties and tags to help organize and query your runs for important information.

Add properties and tags

Using the SDK

To add searchable metadata to your runs, use the add_properties() method. For example, the following code adds the "author" property to the run:

local_script_run.add_properties({"author":"azureml-user"})
print(local_script_run.get_properties())

Properties are immutable, so they create a permanent record for auditing purposes. The following code example results in an error, because we already added "azureml-user" as the "author" property value in the preceding code:

try:
    local_script_run.add_properties({"author":"different-user"})
except Exception as e:
    print(e)

Unlike properties, tags are mutable. To add searchable and meaningful information for consumers of your experiment, use the tag() method.

local_script_run.tag("quality", "great run")
print(local_script_run.get_tags())

local_script_run.tag("quality", "fantastic run")
print(local_script_run.get_tags())

You can also add simple string tags. When these tags appear in the tag dictionary as keys, they have a value of None.

local_script_run.tag("worth another look")
print(local_script_run.get_tags())

Using the CLI

Note

Using the CLI, you can only add or update tags.

To add or update a tag, use the following command:

az ml run update -r runid --add-tag quality='fantastic run'

For more information, see az ml run update.

Query properties and tags

You can query runs within an experiment to return a list of runs that match specific properties and tags.

Using the SDK

list(exp.get_runs(properties={"author":"azureml-user"},tags={"quality":"fantastic run"}))
list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))

Using the CLI

The Azure CLI supports JMESPath queries, which can be used to filter runs based on properties and tags. To use a JMESPath query with the Azure CLI, specify it with the --query parameter. The following examples show basic queries using properties and tags:

# list runs where the author property = 'azureml-user'
az ml run list --experiment-name experiment [?properties.author=='azureml-user']
# list runs where the tag contains a key that starts with 'worth another look'
az ml run list --experiment-name experiment [?tags.keys(@)[?starts_with(@, 'worth another look')]]
# list runs where the author property = 'azureml-user' and the 'quality' tag starts with 'fantastic run'
az ml run list --experiment-name experiment [?properties.author=='azureml-user' && tags.quality=='fantastic run']

For more information on querying Azure CLI results, see Query Azure CLI command output.

Using Azure Machine Learning studio

  1. Navigate to the Pipelines section.

  2. Use the search bar to filter pipelines using tags, descriptions, experiment names, and submitter name.

Example notebooks

The following notebooks demonstrate the concepts in this article:

Next steps