您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用 Azure 机器学习大规模构建 TensorFlow 深度学习模型Build a TensorFlow deep learning model at scale with Azure Machine Learning

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

本文介绍如何使用 Azure 机器学习的TensorFlow 估计器类,按比例运行你的TensorFlow训练脚本。This article shows you how to run your TensorFlow training scripts at scale using Azure Machine Learning's TensorFlow estimator class. 此示例训练并注册 TensorFlow 模型,以便使用深层神经网络(DNN)对手写数字进行分类。This example trains and registers a TensorFlow model to classify handwritten digits using a deep neural network (DNN).

无论是从头开发 TensorFlow 模型还是将现有模型引入云中,都可以使用 Azure 机器学习来横向扩展开源定型作业,以生成、部署、版本和监视生产级模型。Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, you can use Azure Machine Learning to scale out open-source training jobs to build, deploy, version, and monitor production-grade models.

了解有关深度学习与机器学习的详细信息。Learn more about deep learning vs machine learning.

必备条件Prerequisites

在以下任一环境中运行此代码:Run this code on either of these environments:

  • Azure 机器学习计算实例-无需下载或安装Azure Machine Learning compute instance - no downloads or installation necessary

    • 完成教程:设置环境和工作区,创建随 SDK 和示例存储库预先加载的专用笔记本服务器。Complete the Tutorial: Setup environment and workspace to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
    • 在笔记本服务器上的示例深度学习文件夹中,通过导航到以下目录查找已完成且扩展的笔记本 > >:通过导航到以下目录: tensorflow > 部署 > 超参数-tensorflow文件夹。In the samples deep learning folder on the notebook server, find a completed and expanded notebook by navigating to this directory: how-to-use-azureml > ml-frameworks > tensorflow > deployment > train-hyperparameter-tune-deploy-with-tensorflow folder.
  • 你自己的 Jupyter 笔记本服务器Your own Jupyter Notebook server

    你还可以在 GitHub 示例页上找到本指南的已完成Jupyter Notebook 版本You can also find a completed Jupyter Notebook version of this guide on the GitHub samples page. 此笔记本包含扩展的部分,涵盖智能超参数优化、模型部署和笔记本小组件。The notebook includes expanded sections covering intelligent hyperparameter tuning, model deployment, and notebook widgets.

设置试验Set up the experiment

本部分通过加载所需的 Python 包、初始化工作区、创建试验以及上传定型数据和训练脚本来设置训练实验。This section sets up the training experiment by loading the required Python packages, initializing a workspace, creating an experiment, and uploading the training data and training scripts.

导入包Import packages

首先,导入必需的 Python 库。First, import the necessary Python libraries.

import os
import urllib
import shutil
import azureml

from azureml.core import Experiment
from azureml.core import Workspace, Run

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.dnn import TensorFlow

初始化工作区Initialize a workspace

Azure 机器学习工作区是服务的顶级资源。The Azure Machine Learning workspace is the top-level resource for the service. 它为您提供了一个集中的位置来处理您创建的所有项目。It provides you with a centralized place to work with all the artifacts you create. 在 Python SDK 中,可以通过创建workspace对象来访问工作区项目。In the Python SDK, you can access the workspace artifacts by creating a workspace object.

从 "先决条件" 部分创建的 config.json 文件创建工作区对象。Create a workspace object from the config.json file created in the prerequisites section.

ws = Workspace.from_config()

创建深度学习试验Create a deep learning experiment

创建试验和文件夹来保存训练脚本。Create an experiment and a folder to hold your training scripts. 在此示例中,创建一个名为 "mnist" 的实验。In this example, create an experiment called "tf-mnist".

script_folder = './tf-mnist'
os.makedirs(script_folder, exist_ok=True)

exp = Experiment(workspace=ws, name='tf-mnist')

创建文件数据集Create a file dataset

FileDataset 对象引用工作区数据存储或公共 URL 中的一个或多个文件。A FileDataset object references one or multiple files in your workspace datastore or public urls. 文件可以是任何格式,该类提供将文件下载或装载到计算机的功能。The files can be of any format, and the class provides you with the ability to download or mount the files to your compute. 通过创建 FileDataset,可以创建对数据源位置的引用。By creating a FileDataset, you create a reference to the data source location. 如果将任何转换应用于数据集,则它们也会存储在数据集中。If you applied any transformations to the data set, they will be stored in the data set as well. 数据会保留在其现有位置,因此不会产生额外的存储成本。The data remains in its existing location, so no extra storage cost is incurred. 有关详细信息,请参阅 包中的操作Dataset指南。See the how-to guide on the Dataset package for more information.

from azureml.core.dataset import Dataset

web_paths = [
            'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
            'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
            ]
dataset = Dataset.File.from_files(path=web_paths)

使用 register() 方法将数据集注册到你的工作区,以便将其与他人共享,在各种试验中重复使用,并在训练脚本中按名称引用。Use the register() method to register the data set to your workspace so they can be shared with others, reused across various experiments, and referred to by name in your training script.

dataset = dataset.register(workspace=ws,
                           name='mnist dataset',
                           description='training and test dataset',
                           create_new_version=True)

# list the files referenced by dataset
dataset.to_path()

创建计算目标Create a compute target

为要在其上运行的 TensorFlow 作业创建计算目标。Create a compute target for your TensorFlow job to run on. 在此示例中,创建一个启用了 GPU 的 Azure 机器学习计算群集。In this example, create a GPU-enabled Azure Machine Learning compute cluster.

cluster_name = "gpucluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

有关计算目标的详细信息,请参阅什么是计算目标一文。For more information on compute targets, see the what is a compute target article.

创建 TensorFlow 估计器Create a TensorFlow estimator

TensorFlow 估计器提供了一种简单的方法来启动计算目标上的 TensorFlow 培训作业。The TensorFlow estimator provides a simple way of launching a TensorFlow training job on a compute target.

TensorFlow 估计器是通过泛型estimator类实现的,它可用于支持任何框架。The TensorFlow estimator is implemented through the generic estimator class, which can be used to support any framework. 有关使用泛型估计器定型模型的详细信息,请参阅使用估计器 Azure 机器学习训练模型For more information about training models using the generic estimator, see train models with Azure Machine Learning using estimator

如果训练脚本需要额外的 pip 或 conda 包来运行,则可以通过将包的名称通过 pip_packagesconda_packages 参数传递这些包,从而将包安装在生成的 Docker 映像中。If your training script needs additional pip or conda packages to run, you can have the packages installed on the resulting Docker image by passing their names through the pip_packages and conda_packages arguments.

script_params = {
    '--data-folder': dataset.as_named_input('mnist').as_mount(),
    '--batch-size': 50,
    '--first-layer-neurons': 300,
    '--second-layer-neurons': 100,
    '--learning-rate': 0.01
}

est = TensorFlow(source_directory=script_folder,
                 entry_script='tf_mnist.py',
                 script_params=script_params,
                 compute_target=compute_target,
                 use_gpu=True,
                 pip_packages=['azureml-dataprep[pandas,fuse]'])

提示

已将对Tensorflow 2.0的支持添加到 Tensorflow 估计器类中。Support for Tensorflow 2.0 has been added to the Tensorflow estimator class. 有关详细信息,请参阅博客文章See the blog post for more information.

有关自定义 Python 环境的详细信息,请参阅创建和管理用于定型和部署的环境For more information on customizing your Python environment, see Create and manage environments for training and deployment.

提交运行Submit a run

运行对象在作业运行和完成后,为运行历史记录提供接口。The Run object provides the interface to the run history while the job is running and after it has completed.

run = exp.submit(est)
run.wait_for_completion(show_output=True)

在执行运行时,它将经历以下几个阶段:As the Run is executed, it goes through the following stages:

  • 准备:按 TensorFlow 估计器创建 Docker 映像。Preparing: A Docker image is created according to the TensorFlow estimator. 该映像将上传到工作区的容器注册表中,并进行缓存以供稍后运行。The image is uploaded to the workspace's container registry and cached for later runs. 还会将日志流式传输到运行历史记录,并可以查看日志来监视进度。Logs are also streamed to the run history and can be viewed to monitor progress.

  • 缩放:如果 Batch AI 群集需要的节点数多于当前可用的节点数,则群集将尝试增加。Scaling: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.

  • 正在运行:脚本文件夹中的所有脚本都将上载到计算目标,装载或复制数据存储,然后执行 entry_script。Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the entry_script is executed. 输出从 stdout 开始,/logs 文件夹将流式传输到运行历史记录,并可用于监视运行情况。Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.

  • 后期处理:将运行的/outputs 文件夹复制到运行历史记录中。Post-Processing: The ./outputs folder of the run is copied over to the run history.

注册或下载模型Register or download a model

训练模型后,可以将其注册到工作区。Once you've trained the model, you can register it to your workspace. 利用模型注册,可以在工作区中存储模型并对模型进行版本管理,从而简化模型管理和部署Model registration lets you store and version your models in your workspace to simplify model management and deployment. 通过指定参数 model_frameworkmodel_framework_versionresource_configuration,无代码模型部署将变为可用。By specifying the parameters model_framework, model_framework_version, and resource_configuration, no-code model deployment becomes available. 这使你可以从已注册的模型直接将你的模型部署为 web 服务,ResourceConfiguration 对象定义 web 服务的计算资源。This allows you to directly deploy your model as a web service from the registered model, and the ResourceConfiguration object defines the compute resource for the web service.

from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

model = run.register_model(model_name='tf-dnn-mnist', 
                           model_path='outputs/model',
                           model_framework=Model.Framework.TENSORFLOW,
                           model_framework_version='1.13.0',
                           resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5))

您还可以使用 Run 对象下载模型的本地副本。You can also download a local copy of the model by using the Run object. 在训练脚本 mnist-tf.py中,TensorFlow 的保护程序对象将模型保留到本地文件夹(计算目标的本地)。In the training script mnist-tf.py, a TensorFlow saver object persists the model to a local folder (local to the compute target). 您可以使用 "运行" 对象下载副本。You can use the Run object to download a copy.

# Create a model folder in the current directory
os.makedirs('./model', exist_ok=True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name=f, output_file_path=output_file_path)

分布式训练Distributed training

TensorFlow估计器还支持跨 CPU 和 GPU 群集的分布式培训。The TensorFlow estimator also supports distributed training across CPU and GPU clusters. 您可以轻松地运行分布式 TensorFlow 作业,Azure 机器学习将为您管理业务流程。You can easily run distributed TensorFlow jobs and Azure Machine Learning will manage the orchestration for you.

Azure 机器学习支持 TensorFlow 中的两种分布式训练方法:Azure Machine Learning supports two methods of distributed training in TensorFlow:

HorovodHorovod

Horovod是一个开源框架,用于由 Uber 开发的分布式培训。Horovod is an open-source framework for distributed training developed by Uber. 它提供了到分布式 GPU TensorFlow 作业的简单途径。It offers an easy path to distributed GPU TensorFlow jobs.

若要使用 Horovod,请在 TensorFlow 构造函数中指定 distributed_training 参数的MpiConfiguration对象。To use Horovod, specify an MpiConfiguration object for the distributed_training parameter in the TensorFlow constructor. 此参数可确保安装 Horovod 库,以便在训练脚本中使用。This parameter ensures that Horovod library is installed for you to use in your training script.

from azureml.core.runconfig import MpiConfiguration
from azureml.train.dnn import TensorFlow

# Tensorflow constructor
estimator= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='script.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training=MpiConfiguration(),
                      framework_version='1.13',
                      use_gpu=True,
                      pip_packages=['azureml-dataprep[pandas,fuse]'])

参数服务器Parameter server

此外,你还可以运行本机分布式 TensorFlow,它使用参数服务器模型。You can also run native distributed TensorFlow, which uses the parameter server model. 在此方法中,你将在一组参数服务器和工作线程中进行训练。In this method, you train across a cluster of parameter servers and workers. 工作线程在训练期间计算梯度,而参数服务器聚合梯度。The workers calculate the gradients during training, while the parameter servers aggregate the gradients.

若要使用参数服务器方法,请在 TensorFlow 构造函数中指定 distributed_training 参数的TensorflowConfiguration对象。To use the parameter server method, specify a TensorflowConfiguration object for the distributed_training parameter in the TensorFlow constructor.

from azureml.train.dnn import TensorFlow

distributed_training = TensorflowConfiguration()
distributed_training.worker_count = 2

# Tensorflow constructor
tf_est= TensorFlow(source_directory=project_folder,
                      compute_target=compute_target,
                      script_params=script_params,
                      entry_script='script.py',
                      node_count=2,
                      process_count_per_node=1,
                      distributed_training=distributed_training,
                      use_gpu=True,
                      pip_packages=['azureml-dataprep[pandas,fuse]'])

# submit the TensorFlow job
run = exp.submit(tf_est)

在 "TF_CONFIG" 中定义群集规范Define cluster specifications in 'TF_CONFIG`

还需要tf.train.ClusterSpec群集的网络地址和端口,因此 Azure 机器学习为你设置 TF_CONFIG 环境变量。You also need the network addresses and ports of the cluster for the tf.train.ClusterSpec, so Azure Machine Learning sets the TF_CONFIG environment variable for you.

TF_CONFIG 环境变量是一个 JSON 字符串。The TF_CONFIG environment variable is a JSON string. 下面是介绍参数服务器变量的示例:Here is an example of the variable for a parameter server:

TF_CONFIG='{
    "cluster": {
        "ps": ["host0:2222", "host1:2222"],
        "worker": ["host2:2222", "host3:2222", "host4:2222"],
    },
    "task": {"type": "ps", "index": 0},
    "environment": "cloud"
}'

对于 TensorFlow 的高级别tf.estimator API,TensorFlow 会分析 TF_CONFIG 变量,并为你生成群集规范。For TensorFlow's high level tf.estimator API, TensorFlow parses the TF_CONFIG variable and builds the cluster spec for you.

对于 TensorFlow 的低级别核心 Api 进行定型,分析 TF_CONFIG 变量,并在定型代码中生成 tf.train.ClusterSpecFor TensorFlow's lower-level core APIs for training, parse the TF_CONFIG variable and build the tf.train.ClusterSpec in your training code.

import os, json
import tensorflow as tf

tf_config = os.environ.get('TF_CONFIG')
if not tf_config or tf_config == "":
    raise ValueError("TF_CONFIG not found.")
tf_config_json = json.loads(tf_config)
cluster_spec = tf.train.ClusterSpec(cluster)

部署 TensorFlow 模型Deploy a TensorFlow model

你刚注册的模型可以采用与 Azure 机器学习中任何其他已注册模型完全相同的方式进行部署,无论你使用哪种估计器进行定型。The model you just registered can be deployed the exact same way as any other registered model in Azure Machine Learning, regardless of which estimator you used for training. 部署操作方法包含有关注册模型的部分,但你可以直接跳到创建用于部署的计算目标,因为你已有一个已注册的模型。The deployment how-to contains a section on registering models, but you can skip directly to creating a compute target for deployment, since you already have a registered model.

效果无代码模型部署(Preview) No-code model deployment

除了传统的部署路由,还可以使用 Tensorflow 的非代码部署功能(预览版)。Instead of the traditional deployment route, you can also use the no-code deployment feature (preview) for Tensorflow. 通过向上面所示的 model_frameworkmodel_framework_versionresource_configuration 参数注册模型,只需使用 deploy() 静态函数部署模型即可。By registering your model as shown above with the model_framework, model_framework_version, and resource_configuration parameters, you can simply use the deploy() static function to deploy your model.

service = Model.deploy(ws, "tensorflow-web-service", [model])

完整的操作方法涵盖 Azure 机器学习中更深入的部署。The full how-to covers deployment in Azure Machine Learning in greater depth.

后续步骤Next steps

在本文中,你训练并注册了一个 TensorFlow 模型,并了解了用于部署的选项。In this article, you trained and registered a TensorFlow model, and learned about options for deployment. 请参阅以下文章,详细了解 Azure 机器学习。See these other articles to learn more about Azure Machine Learning.