PyTorch class

Definition

Represents an estimator for training in PyTorch experiments.

Supported versions: 1.0, 1.1, 1.2, 1.3

PyTorch(source_directory, *, compute_target=None, vm_size=None, vm_priority=None, entry_script=None, script_params=None, node_count=1, process_count_per_node=1, distributed_backend=None, distributed_training=None, use_gpu=False, use_docker=True, custom_docker_image=None, image_registry_details=None, user_managed=False, conda_packages=None, pip_packages=None, conda_dependencies_file_path=None, pip_requirements_file_path=None, conda_dependencies_file=None, pip_requirements_file=None, environment_variables=None, environment_definition=None, inputs=None, source_directory_data_store=None, shm_size=None, resume_from=None, max_run_duration_seconds=None, framework_version=None, _enable_optimized_mode=False, _disable_validation=False)
Inheritance
azureml.train.estimator._mml_base_estimator.MMLBaseEstimator
azureml.train.estimator._framework_base_estimator._FrameworkBaseEstimator
PyTorch

Parameters

source_directory
str

A local directory containing experiment configuration files.

compute_target
AbstractComputeTarget or str

The compute target where training will happen. This can either be an object or the string "local".

vm_size
str

The VM size of the compute target that will be created for the training. Supported values: Any Azure VM size.

vm_priority
str

The VM priority of the compute target that will be created for the training. If not specified, 'dedicated' is used.

Supported values: 'dedicated' and 'lowpriority'.

This takes effect only when the vm_size param is specified in the input.

entry_script
str

The relative path to the file containing the training script.

script_params
dict

A dictionary of command-line arguments to pass to the training script specified in entry_script.

node_count
int

The number of nodes in the compute target used for training. If greater than 1, an MPI distributed job will be run. Only the AmlCompute target is supported for distributed jobs.

process_count_per_node
int

The number of processes per node. If greater than 1, an MPI distributed job will be run. Only the AmlCompute target is supported for distributed jobs.

distributed_backend
str

The communication backend for distributed training.

DEPRECATED. Use the distributed_training parameter.

Supported values: 'mpi', 'gloo' and 'nccl'.

'mpi': MPI/Horovod 'gloo', 'nccl': Native PyTorch Distributed Training

This parameter is required when node_count or process_count_per_node > 1.

When node_count == 1 and process_count_per_node == 1, no backend will be used unless the backend is explicitly set. Only the AmlCompute target is supported for distributed training.

distributed_training
Mpi or azureml.train.dnn.Gloo or azureml.train.dnn.Nccl

Parameters for running a distributed training job.

For running a distributed job with MPI backend, use Mpi object to specify process_count_per_node. For running a distributed job with gloo backend, use Gloo. For running a distributed job with nccl backend, use Nccl.

use_gpu
bool

Specifies whether the environment to run the experiment should support GPUs. If true, a GPU-based default Docker image will be used in the environment. If false, a CPU-based image will be used. Default docker images (CPU or GPU) will be used only if the custom_docker_image parameter is not set. This setting is used only in Docker-enabled compute targets.

use_docker
bool

Specifies whether the environment to run the experiment should be Docker-based.

custom_docker_image
str

The name of the Docker image from which the image to use for training will be built. If not set, a default CPU-based image will be used as the base image.

image_registry_details
ContainerRegistry

The details of the Docker image registry.

user_managed
bool

Specifies whether Azure ML reuses an existing python environment. If false, Azure ML will create a Python environment based on the conda dependencies specification.

conda_packages
list

A list of strings representing conda packages to be added to the Python environment for the experiment.

pip_packages
list

A list of strings representing pip packages to be added to the Python environment for the experiment.

conda_dependencies_file_path
str

The relative path to the conda dependencies yaml file.

pip_requirements_file_path
str

The relative path to the pip requirements text file. This can be provided in combination with the pip_packages parameter.

conda_dependencies_file
str

The relative path to the conda dependencies yaml file.

pip_requirements_file
str

The relative path to the pip requirements text file. This can be provided in combination with the pip_packages parameter.

environment_variables
dict

A dictionary of environment variables names and values. These environment variables are set on the process where user script is being executed.

environment_definition
Environment

The environment definition for the experiment. It includes PythonSection, DockerSection, and environment variables. Any environment option not directly exposed through other parameters to the Estimator construction can be set using this parameter. If this parameter is specified, it will take precedence over other environment-related parameters like use_gpu, custom_docker_image, conda_packages, or pip_packages. Errors will be reported on invalid combinations of parameters.

inputs
list

A list of DataReference or DatasetConsumptionConfig objects to use as input.

source_directory_data_store
Datastore

The backing datastore for project share.

shm_size
str

The size of the Docker container's shared memory block. If not set, the default azureml.core.environment._DEFAULT_SHM_SIZE is used. For more information, see Docker run reference.

resume_from
DataPath

The data path containing the checkpoint or model files from which to resume the experiment.

max_run_duration_seconds
int

The maximum allowed time for the run. Azure ML will attempt to automatically cancel the run if it takes longer than this value.

framework_version
str

The PyTorch version to be used for executing training code. PyTorch.get_supported_versions() returns a list of the versions supported by the current SDK.

Remarks

When submitting a training job, Azure ML runs your script in a conda environment within a Docker container. The PyTorch containers have the following dependencies installed.

Dependencies PyTorch 1.0/1.1/1.2 PyTorch 1.3
Python 3.6.2 3.6.2
CUDA (GPU image only) 10.0 10.0
cuDNN (GPU image only) 7.6.3 7.6.3
NCCL (GPU image only) 2.4.8 2.4.8
azureml-defaults Latest Latest
OpenMpi 3.1.2 3.1.2
horovod 0.16.1 0.18.1
miniconda 4.5.11 4.5.11
torch 1.0/1.1/1.2 1.3
torchvision 0.2.1 0.4.1
git 2.7.4 2.7.4
tensorboard 1.14 1.14
future 0.17.1 0.17.1

The Docker images extend Ubuntu 16.04.

To install additional dependencies, you can either use the pip_packages or conda_packages parameter. Or, you can specify the pip_requirements_file or conda_dependencies_file parameter. Alternatively, you can build your own image, and pass the custom_docker_image parameter to the estimator constructor.

For more information about Docker containers used in PyTorch training, see https://github.com/Azure/AzureML-Containers.

The following example shows how to use the PyTorch estimator to launch a PyTorch training job on a compute target.


   from azureml.train.dnn import PyTorch

   script_params = {
       '--num_epochs': 30,
       '--output_dir': './outputs'
   }

   estimator = PyTorch(source_directory=project_folder,
                       script_params=script_params,
                       compute_target=compute_target,
                       entry_script='pytorch_train.py',
                       use_gpu=True,
                       pip_packages=['pillow==5.4.1'])

Full sample is available from https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/ml-frameworks/pytorch/deployment/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.ipynb

The PyTorch estimator supports distributed training across CPU and GPU clusters using Horovod, an open-source, all reduce framework for distributed training. For examples and more information about using PyTorch in distributed training, see the tutorial Train and register PyTorch models at scale with Azure Machine Learning.

Attributes

DEFAULT_VERSION

DEFAULT_VERSION = '1.3'

FRAMEWORK_NAME

FRAMEWORK_NAME = 'PyTorch'