Start, monitor, and cancel training runs in Python

The Azure Machine Learning SDK for Python, Machine Learning CLI, and Azure Machine Learning studio provide various methods to monitor, organize, and manage your runs for training and experimentation.

This article shows examples of the following tasks:

  • Monitor run performance.
  • Cancel or fail runs.
  • Create child runs.
  • Tag and find runs.

Prerequisites

You'll need the following items:

Monitor run performance

  • Start a run and its logging process

    1. Set up your experiment by importing the Workspace, Experiment, Run, and ScriptRunConfig classes from the azureml.core package.

      import azureml.core
      from azureml.core import Workspace, Experiment, Run
      from azureml.core import ScriptRunConfig
      
      ws = Workspace.from_config()
      exp = Experiment(workspace=ws, name="explore-runs")
      
    2. Start a run and its logging process with the start_logging() method.

      notebook_run = exp.start_logging()
      notebook_run.log(name="message", value="Hello from run!")
      
  • Monitor the status of a run

    • Get the status of a run with the get_status() method.

      print(notebook_run.get_status())
      
    • To get the run ID, execution time, and additional details about the run, use the get_details() method.

      print(notebook_run.get_details())
      
    • When your run finishes successfully, use the complete() method to mark it as completed.

      notebook_run.complete()
      print(notebook_run.get_status())
      
    • If you use Python's with...as design pattern, the run will automatically mark itself as completed when the run is out of scope. You don't need to manually mark the run as completed.

      with exp.start_logging() as notebook_run:
          notebook_run.log(name="message", value="Hello from run!")
          print(notebook_run.get_status())
      
      print(notebook_run.get_status())
      

Cancel or fail runs

If you notice a mistake or if your run is taking too long to finish, you can cancel the run.

To cancel a run using the SDK, use the cancel() method:

src = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')
local_run = exp.submit(src)
print(local_run.get_status())

local_run.cancel()
print(local_run.get_status())

If your run finishes, but it contains an error (for example, the incorrect training script was used), you can use the fail() method to mark it as failed.

local_run = exp.submit(src)
local_run.fail()
print(local_run.get_status())

Create child runs

Create child runs to group together related runs, such as for different hyperparameter-tuning iterations.

Note

Child runs can only be created using the SDK.

This code example uses the hello_with_children.py script to create a batch of five child runs from within a submitted run by using the child_run() method:

!more hello_with_children.py
src = ScriptRunConfig(source_directory='.', script='hello_with_children.py')

local_run = exp.submit(src)
local_run.wait_for_completion(show_output=True)
print(local_run.get_status())

with exp.start_logging() as parent_run:
    for c,count in enumerate(range(5)):
        with parent_run.child_run() as child:
            child.log(name="Hello from child run", value=c)

Note

As they move out of scope, child runs are automatically marked as completed.

To create many child runs efficiently, use the create_children() method. Because each creation results in a network call, creating a batch of runs is more efficient than creating them one by one.

Submit child runs

Child runs can also be submitted from a parent run. This allows you to create hierarchies of parent and child runs.

You may wish your child runs to use a different run configuration than the parent run. For instance, you might use a less-powerful, CPU-based configuration for the parent, while using GPU-based configurations for your children. Another common desire is to pass each child different arguments and data. To customize a child run, create a ScriptRunConfig object for the child run. The below code does the following:

  • Retrieve a compute resource named "gpu-cluster" from the workspace ws
  • Iterates over different argument values to be passed to the children ScriptRunConfig objects
  • Creates and submits a new child run, using the custom compute resource and argument
  • Blocks until all of the child runs complete
# parent.py
# This script controls the launching of child scripts
from azureml.core import Run, ScriptRunConfig

compute_target = ws.compute_targets["gpu-cluster"]

run = Run.get_context()

child_args = ['Apple', 'Banana', 'Orange']
for arg in child_args: 
    run.log('Status', f'Launching {arg}')
    child_config = ScriptRunConfig(source_directory=".", script='child.py', arguments=['--fruit', arg], compute_target=compute_target)
    # Starts the run asynchronously
    run.submit_child(child_config)

# Experiment will "complete" successfully at this point. 
# Instead of returning immediately, block until child runs complete

for child in run.get_children():
    child.wait_for_completion()

To create many child runs with identical configurations, arguments, and inputs efficiently, use the create_children() method. Because each creation results in a network call, creating a batch of runs is more efficient than creating them one by one.

Within a child run, you can view the parent run ID:

## In child run script
child_run = Run.get_context()
child_run.parent.id

Query child runs

To query the child runs of a specific parent, use the get_children() method. The recursive = True argument allows you to query a nested tree of children and grandchildren.

print(parent_run.get_children())

Tag and find runs

In Azure Machine Learning, you can use properties and tags to help organize and query your runs for important information.

  • Add properties and tags

    To add searchable metadata to your runs, use the add_properties() method. For example, the following code adds the "author" property to the run:

    local_run.add_properties({"author":"azureml-user"})
    print(local_run.get_properties())
    

    Properties are immutable, so they create a permanent record for auditing purposes. The following code example results in an error, because we already added "azureml-user" as the "author" property value in the preceding code:

    try:
        local_run.add_properties({"author":"different-user"})
    except Exception as e:
        print(e)
    

    Unlike properties, tags are mutable. To add searchable and meaningful information for consumers of your experiment, use the tag() method.

    local_run.tag("quality", "great run")
    print(local_run.get_tags())
    
    local_run.tag("quality", "fantastic run")
    print(local_run.get_tags())
    

    You can also add simple string tags. When these tags appear in the tag dictionary as keys, they have a value of None.

    local_run.tag("worth another look")
    print(local_run.get_tags())
    
  • Query properties and tags

    You can query runs within an experiment to return a list of runs that match specific properties and tags.

    list(exp.get_runs(properties={"author":"azureml-user"},tags={"quality":"fantastic run"}))
    list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))
    

Example notebooks

The following notebooks demonstrate the concepts in this article:

Next steps