Set up and use compute targets for model training

With Azure Machine Learning, you can train your model on a variety of resources or environments, collectively referred to as compute targets. A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure HDInsight or a remote virtual machine. You can also create compute targets for model deployment as described in "Where and how to deploy your models".

You can create and manage a compute target using the Azure Machine Learning SDK, Azure portal, your workspace landing page (preview), Azure CLI or Azure Machine Learning VS Code extension. If you have compute targets that were created through another service (for example, an HDInsight cluster), you can use them by attaching them to your Azure Machine Learning workspace.

In this article, you learn how to use various compute targets for model training. The steps for all compute targets follow the same workflow:

  1. Create a compute target if you don’t already have one.
  2. Attach the compute target to your workspace.
  3. Configure the compute target so that it contains the Python environment and package dependencies needed by your script.

Note

Code in this article was tested with Azure Machine Learning SDK version 1.0.39.

Compute targets for training

Azure Machine Learning has varying support across different compute targets. A typical model development lifecycle starts with dev/experimentation on a small amount of data. At this stage, we recommend using a local environment. For example, your local computer or a cloud-based VM. As you scale up your training on larger data sets, or do distributed training, we recommend using Azure Machine Learning Compute to create a single- or multi-node cluster that autoscales each time you submit a run. You can also attach your own compute resource, although support for various scenarios may vary as detailed below:

Compute targets can be reused from one training job to the next. For example, once you attach a remote VM to your workspace, you can reuse it for multiple jobs.

Training  targets GPU support Automated ML ML pipelines Visual interface
Local computer maybe yes    
Azure Machine Learning Compute yes yes &
hyperparameter tuning
yes yes
Remote VM yes yes &
hyperparameter tuning
yes  
Azure Databricks   yes yes  
Azure Data Lake Analytics     yes  
Azure HDInsight     yes  
Azure Batch     yes  

Note

Azure Machine Learning Compute can be created as a persistent resource or created dynamically when you request a run. Run-based creation removes the compute target after the training run is complete, so you cannot reuse compute targets created this way.

What's a run configuration?

When training, it is common to start on your local computer, and later run that training script on a different compute target. With Azure Machine Learning, you can run your script on various compute targets without having to change your script.

All you need to do is define the environment for each compute target within a run configuration. Then, when you want to run your training experiment on a different compute target, specify the run configuration for that compute. For details of specifying an environment and binding it to run configuration, see Create and manage environments for training and deployment.

Learn more about submitting experiments at the end of this article.

What's an estimator?

To facilitate model training using popular frameworks, the Azure Machine Learning Python SDK provides an alternative higher-level abstraction, the estimator class. This class allows you to easily construct run configurations. You can create and use a generic Estimator to submit training scripts that use any learning framework you choose (such as scikit-learn).

For PyTorch, TensorFlow, and Chainer tasks, Azure Machine Learning also provides respective PyTorch, TensorFlow, and Chainer estimators to simplify using these frameworks.

For more information, see Train ML Models with estimators.

What's an ML Pipeline?

With ML pipelines, you can optimize your workflow with simplicity, speed, portability, and reuse. When building pipelines with Azure Machine Learning, you can focus on your expertise, machine learning, rather than on infrastructure and automation.

ML pipelines are constructed from multiple steps, which are distinct computational units in the pipeline. Each step can run independently and use isolated compute resources. This allows multiple data scientists to work on the same pipeline at the same time without over-taxing compute resources, and also makes it easy to use different compute types/sizes for each step.

Tip

ML Pipelines can use run configuration or estimators when training models.

While ML pipelines can train models, they can also prepare data before training and deploy models after training. One of the primary use cases for pipelines is batch scoring. For more information, see Pipelines: Optimize machine learning workflows.

Set up in Python

Use the sections below to configure these compute targets:

Local computer

  1. Create and attach: There's no need to create or attach a compute target to use your local computer as the training environment.

  2. Configure: When you use your local computer as a compute target, the training code is run in your development environment. If that environment already has the Python packages you need, use the user-managed environment.

from azureml.core.runconfig import RunConfiguration

# Edit a run configuration property on the fly.
run_local = RunConfiguration()

run_local.environment.python.user_managed_dependencies = True

Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Azure Machine Learning Compute

Azure Machine Learning Compute is a managed-compute infrastructure that allows the user to easily create a single or multi-node compute. The compute is created within your workspace region as a resource that can be shared with other users in your workspace. The compute scales up automatically when a job is submitted, and can be put in an Azure Virtual Network. The compute executes in a containerized environment and packages your model dependencies in a Docker container.

You can use Azure Machine Learning Compute to distribute the training process across a cluster of CPU or GPU compute nodes in the cloud. For more information on the VM sizes that include GPUs, see GPU-optimized virtual machine sizes.

Azure Machine Learning Compute has default limits, such as the number of cores that can be allocated. For more information, see Manage and request quotas for Azure resources.

You can create an Azure Machine Learning compute environment on demand when you schedule a run, or as a persistent resource.

Run-based creation

You can create Azure Machine Learning Compute as a compute target at run time. The compute is automatically created for your run. The compute is deleted automatically once the run completes.

Note

To specify the max number of nodes to use, you would normally set node_count to the number of nodes. There is currently (04/04/2019) a bug that prevents this from working. As a workaround, use the amlcompute._cluster_max_node_count property of the run configuration. For example, run_config.amlcompute._cluster_max_node_count = 5.

Important

Run-based creation of Azure Machine Learning compute is currently in Preview. Don't use run-based creation if you use automated hyperparameter tuning or automated machine learning. To use hyperparameter tuning or automated machine learning, create a persistent compute target instead.

  1. Create, attach, and configure: The run-based creation performs all the necessary steps to create, attach, and configure the compute target with the run configuration.
from azureml.core.compute import ComputeTarget, AmlCompute

# First, list the supported VM families for Azure Machine Learning Compute
print(AmlCompute.supported_vmsizes(workspace=ws))

from azureml.core.runconfig import RunConfiguration
# Create a new runconfig object
run_temp_compute = RunConfiguration()

# Signal that you want to use AmlCompute to execute the script
run_temp_compute.target = "amlcompute"

# AmlCompute is created in the same region as your workspace
# Set the VM size for AmlCompute from the list of supported_vmsizes
run_temp_compute.amlcompute.vm_size = 'STANDARD_D2_V2'

Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Persistent compute

A persistent Azure Machine Learning Compute can be reused across jobs. The compute can be shared with other users in the workspace and is kept between jobs.

  1. Create and attach: To create a persistent Azure Machine Learning Compute resource in Python, specify the vm_size and max_nodes properties. Azure Machine Learning then uses smart defaults for the other properties. The compute autoscales down to zero nodes when it isn't used. Dedicated VMs are created to run your jobs as needed.

    • vm_size: The VM family of the nodes created by Azure Machine Learning Compute.
    • max_nodes: The max number of nodes to autoscale up to when you run a job on Azure Machine Learning Compute.
    from azureml.core.compute import ComputeTarget, AmlCompute
    from azureml.core.compute_target import ComputeTargetException
    
    # Choose a name for your CPU cluster
    cpu_cluster_name = "cpucluster"
    
    # Verify that cluster does not exist already
    try:
        cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
        print('Found existing cluster, use it.')
    except ComputeTargetException:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                               max_nodes=4)
        cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
    cpu_cluster.wait_for_completion(show_output=True)
    

    You can also configure several advanced properties when you create Azure Machine Learning Compute. The properties allow you to create a persistent cluster of fixed size, or within an existing Azure Virtual Network in your subscription. See the AmlCompute class for details.

    Or you can create and attach a persistent Azure Machine Learning Compute resource in the Azure portal.

  2. Configure: Create a run configuration for the persistent compute target.

    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    from azureml.core.runconfig import DEFAULT_CPU_IMAGE
    
    # Create a new runconfig object
    run_amlcompute = RunConfiguration()
    
    # Use the cpu_cluster you created above. 
    run_amlcompute.target = cpu_cluster
    
    # Enable Docker
    run_amlcompute.environment.docker.enabled = True
    
    # Set Docker base image to the default CPU-based image
    run_amlcompute.environment.docker.base_image = DEFAULT_CPU_IMAGE
    
    # Use conda_dependencies.yml to create a conda environment in the Docker image for execution
    run_amlcompute.environment.python.user_managed_dependencies = False
    
    # Auto-prepare the Docker image when used for execution (if it is not already prepared)
    run_amlcompute.auto_prepare_environment = True
    
    # Specify CondaDependencies obj, add necessary packages
    run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])
    

Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Remote virtual machines

Azure Machine Learning also supports bringing your own compute resource and attaching it to your workspace. One such resource type is an arbitrary remote VM, as long as it's accessible from Azure Machine Learning. The resource can be an Azure VM, a remote server in your organization, or on-premises. Specifically, given the IP address and credentials (user name and password, or SSH key), you can use any accessible VM for remote runs.

You can use a system-built conda environment, an already existing Python environment, or a Docker container. To execute on a Docker container, you must have a Docker Engine running on the VM. This functionality is especially useful when you want a more flexible, cloud-based dev/experimentation environment than your local machine.

Use the Azure Data Science Virtual Machine (DSVM) as the Azure VM of choice for this scenario. This VM is a pre-configured data science and AI development environment in Azure. The VM offers a curated choice of tools and frameworks for full-lifecycle machine learning development. For more information on how to use the DSVM with Azure Machine Learning, see Configure a development environment.

  1. Create: Create a DSVM before using it to train your model. To create this resource, see Provision the Data Science Virtual Machine for Linux (Ubuntu).

    Warning

    Azure Machine Learning only supports virtual machines that run Ubuntu. When you create a VM or choose an existing VM, you must select a VM that uses Ubuntu.

  2. Attach: To attach an existing virtual machine as a compute target, you must provide the fully qualified domain name (FQDN), user name, and password for the virtual machine. In the example, replace <fqdn> with the public FQDN of the VM, or the public IP address. Replace <username> and <password> with the SSH user name and password for the VM.

    from azureml.core.compute import RemoteCompute, ComputeTarget
    
    # Create the compute config 
    compute_target_name = "attach-dsvm"
    attach_config = RemoteCompute.attach_configuration(address = "<fqdn>",
                                                     ssh_port=22,
                                                     username='<username>',
                                                     password="<password>")
    
    # If you authenticate with SSH keys instead, use this code:
    #                                                  ssh_port=22,
    #                                                  username='<username>',
    #                                                  password=None,
    #                                                  private_key_file="<path-to-file>",
    #                                                  private_key_passphrase="<passphrase>")
    
    # Attach the compute
    compute = ComputeTarget.attach(ws, compute_target_name, attach_config)
    
    compute.wait_for_completion(show_output=True)
    

    Or you can attach the DSVM to your workspace using the Azure portal.

  3. Configure: Create a run configuration for the DSVM compute target. Docker and conda are used to create and configure the training environment on the DSVM.

    import azureml.core
    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    
    run_dsvm = RunConfiguration(framework = "python")
    
    # Set the compute target to the Linux DSVM
    run_dsvm.target = compute_target_name 
    
    # Use Docker in the remote VM
    run_dsvm.environment.docker.enabled = True
    
    # Use the CPU base image 
    # To use GPU in DSVM, you must also use the GPU base Docker image "azureml.core.runconfig.DEFAULT_GPU_IMAGE"
    run_dsvm.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE
    print('Base Docker image is:', run_dsvm.environment.docker.base_image)
    
    # Specify the CondaDependencies object
    run_dsvm.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])
    

Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Azure HDInsight

Azure HDInsight is a popular platform for big-data analytics. The platform provides Apache Spark, which can be used to train your model.

  1. Create: Create the HDInsight cluster before you use it to train your model. To create a Spark on HDInsight cluster, see Create a Spark Cluster in HDInsight.

    When you create the cluster, you must specify an SSH user name and password. Take note of these values, as you need them to use HDInsight as a compute target.

    After the cluster is created, connect to it with the hostname <clustername>-ssh.azurehdinsight.net, where <clustername> is the name that you provided for the cluster.

  2. Attach: To attach an HDInsight cluster as a compute target, you must provide the hostname, user name, and password for the HDInsight cluster. The following example uses the SDK to attach a cluster to your workspace. In the example, replace <clustername> with the name of your cluster. Replace <username> and <password> with the SSH user name and password for the cluster.

    from azureml.core.compute import ComputeTarget, HDInsightCompute
    from azureml.exceptions import ComputeTargetException
    
    try:
     # if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase
     attach_config = HDInsightCompute.attach_configuration(address='<clustername>-ssh.azureinsight.net', 
                                                           ssh_port=22, 
                                                           username='<ssh-username>', 
                                                           password='<ssh-pwd>')
     hdi_compute = ComputeTarget.attach(workspace=ws, 
                                        name='myhdi', 
                                        attach_configuration=attach_config)
    
    except ComputeTargetException as e:
     print("Caught = {}".format(e.message))
    
    hdi_compute.wait_for_completion(show_output=True)
    

    Or you can attach the HDInsight cluster to your workspace using the Azure portal.

  3. Configure: Create a run configuration for the HDI compute target.

    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    
    
    # use pyspark framework
    run_hdi = RunConfiguration(framework="pyspark")
    
    # Set compute target to the HDI cluster
    run_hdi.target = hdi_compute.name
    
    # specify CondaDependencies object to ask system installing numpy
    cd = CondaDependencies()
    cd.add_conda_package('numpy')
    run_hdi.environment.python.conda_dependencies = cd
    

Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Azure Batch

Azure Batch is used to run large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. AzureBatchStep can be used in an Azure Machine Learning Pipeline to submit jobs to an Azure Batch pool of machines.

To attach Azure Batch as a compute target, you must use the Azure Machine Learning SDK and provide the following information:

  • Azure Batch compute name: A friendly name to be used for the compute within the workspace
  • Azure Batch account name: The name of the Azure Batch account
  • Resource Group: The resource group that contains the Azure Batch account.

The following code demonstrates how to attach Azure Batch as a compute target:

from azureml.core.compute import ComputeTarget, BatchCompute
from azureml.exceptions import ComputeTargetException

# Name to associate with new compute in workspace
batch_compute_name = 'mybatchcompute'

# Batch account details needed to attach as compute to workspace
batch_account_name = "<batch_account_name>"  # Name of the Batch account
# Name of the resource group which contains this account
batch_resource_group = "<batch_resource_group>"

try:
    # check if the compute is already attached
    batch_compute = BatchCompute(ws, batch_compute_name)
except ComputeTargetException:
    print('Attaching Batch compute...')
    provisioning_config = BatchCompute.attach_configuration(
        resource_group=batch_resource_group, account_name=batch_account_name)
    batch_compute = ComputeTarget.attach(
        ws, batch_compute_name, provisioning_config)
    batch_compute.wait_for_completion()
    print("Provisioning state:{}".format(batch_compute.provisioning_state))
    print("Provisioning errors:{}".format(batch_compute.provisioning_errors))

print("Using Batch compute:{}".format(batch_compute.cluster_resource_id))

Set up in Azure portal

You can access the compute targets that are associated with your workspace in the Azure portal. You can use the portal to:

After a target is created and attached to your workspace, you will use it in your run configuration with a ComputeTarget object:

from azureml.core.compute import ComputeTarget
myvm = ComputeTarget(workspace=ws, name='my-vm-name')

View compute targets

To see the compute targets for your workspace, use the following steps:

  1. Navigate to the Azure portal and open your workspace. You can also access these same steps in your workspace landing page (preview), although the images below show the Azure portal.

  2. Under Applications, select Compute.

    View compute tab

Create a compute target

Follow the previous steps to view the list of compute targets. Then use these steps to create a compute target:

  1. Select the plus sign (+) to add a compute target.

    Add a compute target

  2. Enter a name for the compute target.

  3. Select Machine Learning Compute as the type of compute to use for Training.

    Note

    Azure Machine Learning Compute is the only managed-compute resource you can create in the Azure portal. All other compute resources can be attached after they are created.

  4. Fill out the form. Provide values for the required properties, especially VM Family, and the maximum nodes to use to spin up the compute.

  5. Select Create.

  6. View the status of the create operation by selecting the compute target from the list:

    Select a compute target to view the create operation status

  7. You then see the details for the compute target:

    View the computer target details

Attach compute targets

To use compute targets created outside the Azure Machine Learning workspace, you must attach them. Attaching a compute target makes it available to your workspace.

Follow the steps described earlier to view the list of compute targets. Then use the following steps to attach a compute target:

  1. Select the plus sign (+) to add a compute target.

  2. Enter a name for the compute target.

  3. Select the type of compute to attach for Training:

    Important

    Not all compute types can be attached from the Azure portal. The compute types that can currently be attached for training include:

    • A remote VM
    • Azure Databricks (for use in machine learning pipelines)
    • Azure Data Lake Analytics (for use in machine learning pipelines)
    • Azure HDInsight
  4. Fill out the form and provide values for the required properties.

    Note

    Microsoft recommends that you use SSH keys, which are more secure than passwords. Passwords are vulnerable to brute force attacks. SSH keys rely on cryptographic signatures. For information on how to create SSH keys for use with Azure Virtual Machines, see the following documents:

  5. Select Attach.

  6. View the status of the attach operation by selecting the compute target from the list.

Set up with CLI

You can access the compute targets that are associated with your workspace using the CLI extension for Azure Machine Learning. You can use the CLI to:

  • Create a managed compute target
  • Update a managed compute target
  • Attach an unmanaged compute target

For more information, see Resource management.

Set up with VS Code

You can access, create and manage the compute targets that are associated with your workspace using the VS Code extension for Azure Machine Learning.

Submit training run using Azure Machine Learning SDK

After you create a run configuration, you use it to run your experiment. The code pattern to submit a training run is the same for all types of compute targets:

  1. Create an experiment to run
  2. Submit the run.
  3. Wait for the run to complete.

Important

When you submit the training run, a snapshot of the directory that contains your training scripts is created and sent to the compute target. It is also stored as part of the experiment in your workspace. If you change files and submit the run again, only the changed files will be uploaded.

To prevent files from being included in the snapshot, create a .gitignore or .amlignore file in the directory and add the files to it. The .amlignore file uses the same syntax and patterns as the .gitignore file. If both files exist, the .amlignore file takes precedence.

For more information, see Snapshots.

Create an experiment

First, create an experiment in your workspace.

from azureml.core import Experiment
experiment_name = 'my_experiment'

exp = Experiment(workspace=ws, name=experiment_name)

Submit the experiment

Submit the experiment with a ScriptRunConfig object. This object includes the:

  • source_directory: The source directory that contains your training script
  • script: Identify the training script
  • run_config: The run configuration, which in turn defines where the training will occur.

For example, to use the local target configuration:

from azureml.core import ScriptRunConfig
import os 

script_folder = os.getcwd()
src = ScriptRunConfig(source_directory = script_folder, script = 'train.py', run_config = run_local)
run = exp.submit(src)
run.wait_for_completion(show_output = True)

Switch the same experiment to run in a different compute target by using a different run configuration, such as the amlcompute target:

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory = script_folder, script = 'train.py', run_config = run_amlcompute)
run = exp.submit(src)
run.wait_for_completion(show_output = True)

Tip

This example defaults to only using one node of the compute target for training. To use more than one node, set the node_count of the run configuration to the desired number of nodes. For example, the following code sets the number of nodes used for training to four:

src.run_config.node_count = 4

Or you can:

For more information, see the ScriptRunConfig and RunConfiguration documentation.

Create run configuration and submit run using Azure Machine Learning CLI

You can use Azure CLI and Machine Learning CLI extension to create run configurations and submit runs on different compute targets. The following examples assume that you have an existing Azure Machine Learning Workspace and you have logged in to Azure using az login CLI command.

Create run configuration

The simplest way to create run configuration is to navigate the folder that contains your machine learning Python scripts, and use CLI command

az ml folder attach

This command creates a subfolder .azureml that contains template run configuration files for different compute targets. You can copy and edit these files to customize your configuration, for example to add Python packages or change Docker settings.

Structure of run configuration file

The run configuration file is YAML formatted, with following sections

  • The script to run and its arguments
  • Compute target name, either "local" or name of a compute under the workspace.
  • Parameters for executing the run: framework, communicator for distributed runs, maximum duration, and number of compute nodes.
  • Environment section. See Create and manage environments for training and deployment for details of the fields in this section.
    • To specify Python packages to install for the run, create conda environment file, and set condaDependenciesFile field.
  • Run history details to specify log file folder, and to enable or disable output collection and run history snapshots.
  • Configuration details specific to the framework selected.
  • Data reference and data store details.
  • Configuration details specific for Machine Learning Compute for creating a new cluster.

Create an experiment

First, create an experiment for your runs

az ml experiment create -n <experiment>

Script run

To submit a script run, execute a command

az ml run submit-script -e <experiment> -c <runconfig> my_train.py

HyperDrive run

You can use HyperDrive with Azure CLI to perform parameter tuning runs. First, create a HyperDrive configuration file in the following format. See Tune hyperparameters for your model article for details on hyperparameter tuning parameters.

# hdconfig.yml
sampling: 
    type: random # Supported options: Random, Grid, Bayesian
    parameter_space: # specify a name|expression|values tuple for each parameter.
    - name: --penalty # The name of a script parameter to generate values for.
      expression: choice # supported options: choice, randint, uniform, quniform, loguniform, qloguniform, normal, qnormal, lognormal, qlognormal
      values: [0.5, 1, 1.5] # The list of values, the number of values is dependent on the expression specified.
policy: 
    type: BanditPolicy # Supported options: BanditPolicy, MedianStoppingPolicy, TruncationSelectionPolicy, NoTerminationPolicy
    evaluation_interval: 1 # Policy properties are policy specific. See the above link for policy specific parameter details.
    slack_factor: 0.2
primary_metric_name: Accuracy # The metric used when evaluating the policy
primary_metric_goal: Maximize # Maximize|Minimize
max_total_runs: 8 # The maximum number of runs to generate
max_concurrent_runs: 2 # The number of runs that can run concurrently.
max_duration_minutes: 100 # The maximum length of time to run the experiment before cancelling.

Add this file alongside the run configuration files. Then submit a HyperDrive run using:

az ml run submit-hyperdrive -e <experiment> -c <runconfig> --hyperdrive-configuration-name <hdconfig> my_train.py

Note the arguments section in runconfig and parameter space in HyperDrive config. They contain the command-line arguments to be passed to training script. The value in runconfig stays the same for each iteration, while the range in HyperDrive config is iterated over. Do not specify the same argument in both files.

For more details on these az ml CLI commands and full set of arguments, see the reference documentation.

Git tracking and integration

When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. For example, the current commit ID for the repository is logged as part of the history.

Notebook examples

See these notebooks for examples of training with various compute targets:

Learn how to run notebooks by following the article, Use Jupyter notebooks to explore this service.

Next steps