Build a TensorFlow deep learning model at scale with Azure Machine Learning

APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

This article shows you how to run your TensorFlow training scripts at scale using Azure Machine Learning's TensorFlow estimator class. This example trains and registers a TensorFlow model to classify handwritten digits using a deep neural network (DNN).

Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, you can use Azure Machine Learning to scale out open-source training jobs to build, deploy, version, and monitor production-grade models.

Learn more about deep learning vs machine learning.

Prerequisites

Run this code on either of these environments:

  • Azure Machine Learning Notebook VM - no downloads or installation necessary

    • Complete the Tutorial: Setup environment and workspace to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
    • In the samples deep learning folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > tensorflow > deployment > train-hyperparameter-tune-deploy-with-tensorflow folder.
  • Your own Jupyter Notebook server

    You can also find a completed Jupyter Notebook version of this guide on the GitHub samples page. The notebook includes expanded sections covering intelligent hyperparameter tuning, model deployment, and notebook widgets.

Set up the experiment

This section sets up the training experiment by loading the required Python packages, initializing a workspace, creating an experiment, and uploading the training data and training scripts.

Import packages

First, import the necessary Python libraries.

import os
import urllib
import shutil
import azureml

from azureml.core import Experiment
from azureml.core import Workspace, Run

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

Initialize a workspace

The Azure Machine Learning workspace is the top-level resource for the service. It provides you with a centralized place to work with all the artifacts you create. In the Python SDK, you can access the workspace artifacts by creating a workspace object.

Create a workspace object from the config.json file created in the prerequisites section.

ws = Workspace.from_config()

Create a deep learning experiment

Create an experiment and a folder to hold your training scripts. In this example, create an experiment called "tf-mnist".

script_folder = './tf-mnist'
os.makedirs(script_folder, exist_ok=True)

exp = Experiment(workspace=ws, name='tf-mnist')

Create a file dataset

A FileDataset object references one or multiple files in your workspace datastore or public urls. The files can be of any format, and the class provides you with the ability to download or mount the files to your compute. By creating a FileDataset, you create a reference to the data source location. If you applied any transformations to the data set, they will be stored in the data set as well. The data remains in its existing location, so no extra storage cost is incurred. See the how-to guide on the Dataset package for more information.

from azureml.core.dataset import Dataset

web_paths = [
            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
            ]
dataset = Dataset.File.from_files(path=web_paths)

Use the register() method to register the data set to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script.

dataset = dataset.register(workspace=ws,
                           name='mnist dataset',
                           description='training and test dataset',
                           create_new_version=True)

# list the files referenced by dataset
dataset.to_path()

Create a compute target

Create a compute target for your TensorFlow job to run on. In this example, create a GPU-enabled Azure Machine Learning compute cluster.

cluster_name = "gpucluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

For more information on compute targets, see the what is a compute target article.

Create a TensorFlow estimator

The TensorFlow estimator provides a simple way of launching a TensorFlow training job on a compute target.

The TensorFlow estimator is implemented through the generic estimator class, which can be used to support any framework. For more information about training models using the generic estimator, see train models with Azure Machine Learning using estimator

If your training script needs additional pip or conda packages to run, you can have the packages installed on the resulting Docker image by passing their names through the pip_packages and conda_packages arguments.

script_params = {
    '--data-folder': dataset.as_named_input('mnist').as_mount(),
    '--batch-size': 50,
    '--first-layer-neurons': 300,
    '--second-layer-neurons': 100,
    '--learning-rate': 0.01
}

est = TensorFlow(source_directory=script_folder,
                 entry_script='tf_mnist.py',
                 script_params=script_params,
                 compute_target=compute_target,
                 use_gpu=True)

Tip

Support for Tensorflow 2.0 has been added to the Tensorflow estimator class. See the blog post for more information.

Submit a run

The Run object provides the interface to the run history while the job is running and after it has completed.

run = exp.submit(est)
run.wait_for_completion(show_output=True)

As the Run is executed, it goes through the following stages:

  • Preparing: A Docker image is created according to the TensorFlow estimator. The image is uploaded to the workspace's container registry and cached for later runs. Logs are also streamed to the run history and can be viewed to monitor progress.

  • Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.

  • Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the entry_script is executed. Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.

  • Post-Processing: The ./outputs folder of the run is copied over to the run history.

Register or download a model

Once you've trained the model, you can register it to your workspace. Model registration lets you store and version your models in your workspace to simplify model management and deployment. By specifying the parameters model_framework, model_framework_version, and resource_configuration, no-code model deployment becomes available. This allows you to directly deploy your model as a web service from the registered model, and the ResourceConfiguration object defines the compute resource for the web service.

from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

model = run.register_model(model_name='tf-dnn-mnist', 
                           model_path='outputs/model',
                           model_framework=Model.Framework.TENSORFLOW,
                           model_framework_version='1.13.0',
                           resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5))

You can also download a local copy of the model by using the Run object. In the training script mnist-tf.py, a TensorFlow saver object persists the model to a local folder (local to the compute target). You can use the Run object to download a copy.

# Create a model folder in the current directory
os.makedirs('./model', exist_ok=True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name=f, output_file_path=output_file_path)

Distributed training

The TensorFlow estimator also supports distributed training across CPU and GPU clusters. You can easily run distributed TensorFlow jobs and Azure Machine Learning will manage the orchestration for you.

Azure Machine Learning supports two methods of distributed training in TensorFlow:

Horovod

Horovod is an open-source framework for distributed training developed by Uber. It offers an easy path to distributed GPU TensorFlow jobs.

To use Horovod, specify an MpiConfiguration object for the distributed_training parameter in the TensorFlow constructor. This parameter ensures that Horovod library is installed for you to use in your training script.

from azureml.core.runconfig import MpiConfiguration
from azureml.train.dnn import TensorFlow

# Tensorflow constructor
estimator= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='script.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training=MpiConfiguration(),
                      framework_version='1.13',
                      use_gpu=True)

Parameter server

You can also run native distributed TensorFlow, which uses the parameter server model. In this method, you train across a cluster of parameter servers and workers. The workers calculate the gradients during training, while the parameter servers aggregate the gradients.

To use the parameter server method, specify a TensorflowConfiguration object for the distributed_training parameter in the TensorFlow constructor.

from azureml.train.dnn import TensorFlow

distributed_training = TensorflowConfiguration()
distributed_training.worker_count = 2

# Tensorflow constructor
estimator= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='script.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training=distributed_training,
                      use_gpu=True)

# submit the TensorFlow job
run = exp.submit(tf_est)

Define cluster specifications in 'TF_CONFIG`

You also need the network addresses and ports of the cluster for the tf.train.ClusterSpec, so Azure Machine Learning sets the TF_CONFIG environment variable for you.

The TF_CONFIG environment variable is a JSON string. Here is an example of the variable for a parameter server:

TF_CONFIG='{
    "cluster": {
        "ps": ["host0:2222", "host1:2222"],
        "worker": ["host2:2222", "host3:2222", "host4:2222"],
    },
    "task": {"type": "ps", "index": 0},
    "environment": "cloud"
}'

For TensorFlow's high level tf.estimator API, TensorFlow parses the TF_CONFIG variable and builds the cluster spec for you.

For TensorFlow's lower-level core APIs for training, parse the TF_CONFIG variable and build the tf.train.ClusterSpec in your training code.

import os, json
import tensorflow as tf

tf_config = os.environ.get('TF_CONFIG')
if not tf_config or tf_config == "":
    raise ValueError("TF_CONFIG not found.")
tf_config_json = json.loads(tf_config)
cluster_spec = tf.train.ClusterSpec(cluster)

Deployment

The model you just registered can be deployed the exact same way as any other registered model in Azure Machine Learning, regardless of which estimator you used for training. The deployment how-to contains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.

(Preview) No-code model deployment

Instead of the traditional deployment route, you can also use the no-code deployment feature (preview)for Tensorflow. By registering your model as shown above with the model_framework, model_framework_version, and resource_configuration parameters, you can simply use the deploy() static function to deploy your model.

service = Model.deploy(ws, "tensorflow-web-service", [model])

The full how-to covers deployment in Azure Machine Learning in greater depth.

Next steps

In this article, you trained and registered a TensorFlow model, and learned about options for deployment. See these other articles to learn more about Azure Machine Learning.