Configure and submit training runs

In this article, you learn how to configure and submit Azure Machine Learning runs to train your models.

When training, it is common to start on your local computer, and then later scale out to a cloud-based cluster. With Azure Machine Learning, you can run your script on various compute targets without having to change your training script.

All you need to do is define the environment for each compute target within a script run configuration. Then, when you want to run your training experiment on a different compute target, specify the run configuration for that compute.

Prerequisites

What's a script run configuration?

A ScriptRunConfig is used to configure the information necessary for submitting a training run as part of an experiment.

You submit your training experiment with a ScriptRunConfig object. This object includes the:

  • source_directory: The source directory that contains your training script
  • script: The training script to run
  • compute_target: The compute target to run on
  • environment: The environment to use when running the script
  • and some additional configurable options (see the reference documentation for more information)

Train your model

The code pattern to submit a training run is the same for all types of compute targets:

  1. Create an experiment to run
  2. Create an environment where the script will run
  3. Create a ScriptRunConfig, which specifies the compute target and environment
  4. Submit the run
  5. Wait for the run to complete

Or you can:

Create an experiment

Create an experiment in your workspace.

from azureml.core import Experiment

experiment_name = 'my_experiment'
experiment = Experiment(workspace=ws, name=experiment_name)

Select a compute target

Select the compute target where your training script will run on. If no compute target is specified in the ScriptRunConfig, or if compute_target='local', Azure ML will execute your script locally.

The example code in this article assumes that you have already created a compute target my_compute_target from the "Prerequisites" section.

Create an environment

Azure Machine Learning environments are an encapsulation of the environment where your machine learning training happens. They specify the Python packages, Docker image, environment variables, and software settings around your training and scoring scripts. They also specify runtimes (Python, Spark, or Docker).

You can either define your own environment, or use an Azure ML curated environment. Curated environments are predefined environments that are available in your workspace by default. These environments are backed by cached Docker images which reduces the run preparation cost. See Azure Machine Learning Curated Environments for the full list of available curated environments.

For a remote compute target, you can use one of these popular curated environments to start with:

from azureml.core import Workspace, Environment

ws = Workspace.from_config()
myenv = Environment.get(workspace=ws, name="AzureML-Minimal")

For more information and details about environments, see Create & use software environments in Azure Machine Learning.

Local compute target

If your compute target is your local machine, you are responsible for ensuring that all the necessary packages are available in the Python environment where the script runs. Use python.user_managed_dependencies to use your current Python environment (or the Python on the path you specify).

from azureml.core import Environment

myenv = Environment("user-managed-env")
myenv.python.user_managed_dependencies = True

# You can choose a specific Python environment by pointing to a Python path 
# myenv.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python'

Create the script run configuration

Now that you have a compute target (my_compute_target) and environment (myenv), create a script run configuration that runs your training script (train.py) located in your project_folder directory:

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder,
                      script='train.py',
                      compute_target=my_compute_target,
                      environment=myenv)

# Set compute target
# Skip this if you are running on your local computer
script_run_config.run_config.target = my_compute_target

If you do not specify an environment, a default environment will be created for you.

If you have command-line arguments you want to pass to your training script, you can specify them via the arguments parameter of the ScriptRunConfig constructor, e.g. arguments=['--arg1', arg1_val, '--arg2', arg2_val].

If you want to override the default maximum time allowed for the run, you can do so via the max_run_duration_seconds parameter. The system will attempt to automatically cancel the run if it takes longer than this value.

Specify a distributed job configuration

If you want to run a distributed training job, provide the distributed job-specific config to the distributed_job_config parameter. Supported config types include MpiConfiguration, TensorflowConfiguration, and PyTorchConfiguration.

For more information and examples on running distributed Horovod, TensorFlow and PyTorch jobs, see:

Submit the experiment

run = experiment.submit(config=src)
run.wait_for_completion(show_output=True)

Important

When you submit the training run, a snapshot of the directory that contains your training scripts is created and sent to the compute target. It is also stored as part of the experiment in your workspace. If you change files and submit the run again, only the changed files will be uploaded.

To prevent unnecessary files from being included in the snapshot, make an ignore file (.gitignore or .amlignore) in the directory. Add the files and directories to exclude to this file. For more information on the syntax to use inside this file, see syntax and patterns for .gitignore. The .amlignore file uses the same syntax. If both files exist, the .amlignore file takes precedence.

For more information about snapshots, see Snapshots.

Important

Special Folders Two folders, outputs and logs, receive special treatment by Azure Machine Learning. During training, when you write files to folders named outputs and logs that are relative to the root directory (./outputs and ./logs, respectively), the files will automatically upload to your run history so that you have access to them once your run is finished.

To create artifacts during training (such as model files, checkpoints, data files, or plotted images) write these to the ./outputs folder.

Similarly, you can write any logs from your training run to the ./logs folder. To utilize Azure Machine Learning's TensorBoard integration make sure you write your TensorBoard logs to this folder. While your run is in progress, you will be able to launch TensorBoard and stream these logs. Later, you will also be able to restore the logs from any of your previous runs.

For example, to download a file written to the outputs folder to your local machine after your remote training run: run.download_file(name='outputs/my_output_file', output_file_path='my_destination_path')

Git tracking and integration

When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. For more information, see Git integration for Azure Machine Learning.

Notebook examples

See these notebooks for examples of configuring runs for various training scenarios:

Learn how to run notebooks by following the article Use Jupyter notebooks to explore this service.

Next steps