您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

设置并使用模型定型的计算目标Set up and use compute targets for model training

应用于:是基本版是Enterprise 版本               (升级到 Enterprise 版本APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

使用 Azure 机器学习,可以在各种资源或环境(统称为计算目标)上训练模型。With Azure Machine Learning, you can train your model on a variety of resources or environments, collectively referred to as compute targets. 计算目标可以是本地计算机,也可以是云资源,例如 Azure 机器学习计算、Azure HDInsight 或远程虚拟机。A compute target can be a local machine or a cloud resource, such as an Azure Machine Learning Compute, Azure HDInsight or a remote virtual machine. 还可以为模型部署创建计算目标,如“部署模型的位置和方式”中所述。You can also create compute targets for model deployment as described in "Where and how to deploy your models".

可以使用 Azure 机器学习 SDK、Azure 机器学习 studio、Azure CLI 或 Azure 机器学习 VS Code 扩展来创建和管理计算目标。You can create and manage a compute target using the Azure Machine Learning SDK, Azure Machine Learning studio, Azure CLI or Azure Machine Learning VS Code extension. 如果计算目标是通过其他服务(例如 HDInsight 群集)创建的,则可以通过将其附加到 Azure 机器学习工作区来使用它们。If you have compute targets that were created through another service (for example, an HDInsight cluster), you can use them by attaching them to your Azure Machine Learning workspace.

本文介绍如何使用各种计算目标进行模型训练。In this article, you learn how to use various compute targets for model training. 适用于所有计算目标的步骤遵循相同的工作流:The steps for all compute targets follow the same workflow:

  1. __创建__计算目标(如果没有)。Create a compute target if you don’t already have one.
  2. 将计算目标__附加__到工作区。Attach the compute target to your workspace.
  3. __配置__计算目标,使其包含脚本所需的 Python 环境和包依赖项。Configure the compute target so that it contains the Python environment and package dependencies needed by your script.

备注

本文中的代码已通过 Azure 机器学习 SDK 版本1.0.74 进行测试。Code in this article was tested with Azure Machine Learning SDK version 1.0.74.

训练的计算目标Compute targets for training

Azure 机器学习在不同的计算目标之间具有不同的支持。Azure Machine Learning has varying support across different compute targets. 典型的模型开发生命周期从开发/试验少量的数据开始。A typical model development lifecycle starts with dev/experimentation on a small amount of data. 在此阶段,我们建议使用本地环境。At this stage, we recommend using a local environment. 例如,本地计算机或基于云的 VM。For example, your local computer or a cloud-based VM. 针对更大的数据集扩展训练或执行分布式训练时,我们建议使用 Azure 机器学习计算来创建可在每次提交运行时自动缩放的单节点或多节点群集。As you scale up your training on larger data sets, or do distributed training, we recommend using Azure Machine Learning Compute to create a single- or multi-node cluster that autoscales each time you submit a run. 你也可以附加自己的计算资源,不过,为各种方案提供的支持可能有所不同,详情如下:You can also attach your own compute resource, although support for various scenarios may vary as detailed below:

计算目标可从一个训练作业重复用于下一个定型作业Compute targets can be reused from one training job to the next. 例如,将远程 VM 附加到你的工作区后,可以将其重复用于多个作业。For example, once you attach a remote VM to your workspace, you can reuse it for multiple jobs.

培训  目标Training  targets GPU 支持GPU support 自动 MLAutomated ML ML 管道ML pipelines Azure 机器学习设计器Azure Machine Learning designer
本地计算机Local computer 可能maybe yes    
Azure 机器学习计算群集Azure Machine Learning compute cluster yes 是 &yes &
超参数 优化hyperparameter tuning
yes yes
远程 VMRemote VM yes 是 &yes &
超参数优化hyperparameter tuning
yes  
Azure DatabricksAzure Databricks   yes yes  
Azure Data Lake AnalyticsAzure Data Lake Analytics     yes  
Azure HDInsightAzure HDInsight     yes  
Azure BatchAzure Batch     yes  

备注

Azure 机器学习计算资源可以创建为持久资源,也可以在你请求运行时动态创建。Azure Machine Learning Compute can be created as a persistent resource or created dynamically when you request a run. 基于运行的创建在训练运行完成后会删除计算目标,因此无法重复使用以这种方式创建的计算目标。Run-based creation removes the compute target after the training run is complete, so you cannot reuse compute targets created this way.

什么是运行配置?What's a run configuration?

训练时,通常会在本地计算机上开始,然后在不同的计算目标上运行该训练脚本。When training, it is common to start on your local computer, and later run that training script on a different compute target. 通过 Azure 机器学习,你可以在不同的计算目标上运行脚本,而无需更改脚本。With Azure Machine Learning, you can run your script on various compute targets without having to change your script.

只需在运行配置中为每个计算目标定义环境。All you need to do is define the environment for each compute target within a run configuration. 然后,当你想要在不同的计算目标上运行训练试验时,可以指定该计算的运行配置。Then, when you want to run your training experiment on a different compute target, specify the run configuration for that compute. 有关指定环境并将其绑定到运行配置的详细信息,请参阅创建和管理用于定型和部署的环境For details of specifying an environment and binding it to run configuration, see Create and manage environments for training and deployment.

本文的最后详细介绍了如何提交试验Learn more about submitting experiments at the end of this article.

估计器是什么?What's an estimator?

为了便于使用常见框架进行模型训练,Azure 机器学习 Python SDK 提供了一个替代级别更高的抽象方法,即估计器类。To facilitate model training using popular frameworks, the Azure Machine Learning Python SDK provides an alternative higher-level abstraction, the estimator class. 此类使你能够轻松地构造运行配置。This class allows you to easily construct run configurations. 您可以创建和使用一般估计器来提交使用您选择的任何学习框架的培训脚本(如 scikit-learn)。You can create and use a generic Estimator to submit training scripts that use any learning framework you choose (such as scikit-learn).

对于 PyTorch、TensorFlow 和 Chainer 任务,Azure 机器学习还提供相应的PyTorchTensorFlowChainer估算以简化使用这些框架的操作。For PyTorch, TensorFlow, and Chainer tasks, Azure Machine Learning also provides respective PyTorch, TensorFlow, and Chainer estimators to simplify using these frameworks.

有关详细信息,请参阅用估算训练 ML 模型For more information, see Train ML Models with estimators.

什么是 ML 管道?What's an ML Pipeline?

借助 ML 管道,你可以通过简单、速度、可移植性和重复使用优化工作流。With ML pipelines, you can optimize your workflow with simplicity, speed, portability, and reuse. 当通过 Azure 机器学习构建管道时,可以专注于专业技能、机器学习,而不是在基础结构和自动化上。When building pipelines with Azure Machine Learning, you can focus on your expertise, machine learning, rather than on infrastructure and automation.

ML 管道是从多个步骤构造的,这些步骤是管道中的不同计算单元。ML pipelines are constructed from multiple steps, which are distinct computational units in the pipeline. 每个步骤都可以独立运行并使用独立的计算资源。Each step can run independently and use isolated compute resources. 这允许多个数据科学家在同一时间同时处理同一管道,而不会产生过多的计算资源,同时还可以轻松地对每个步骤使用不同的计算类型/大小。This allows multiple data scientists to work on the same pipeline at the same time without over-taxing compute resources, and also makes it easy to use different compute types/sizes for each step.

提示

在训练模型时,ML 管道可以使用运行配置或估算。ML Pipelines can use run configuration or estimators when training models.

虽然 ML 管道可以训练模型,但它们还可以在训练和部署模型之后准备数据。While ML pipelines can train models, they can also prepare data before training and deploy models after training. 管道的主要用例之一是批处理评分。One of the primary use cases for pipelines is batch scoring. 有关详细信息,请参阅管道:优化机器学习工作流For more information, see Pipelines: Optimize machine learning workflows.

在 Python 中设置Set up in Python

使用以下部分配置这些计算目标:Use the sections below to configure these compute targets:

本地计算机Local computer

  1. 创建和附加:无需创建或附加计算目标即可使用本地计算机作为训练环境。Create and attach: There's no need to create or attach a compute target to use your local computer as the training environment.

  2. 配置:将本地计算机用作计算目标时,将在开发环境中运行训练代码。Configure: When you use your local computer as a compute target, the training code is run in your development environment. 如果该环境已包含所需的 Python 包,请使用用户管理的环境。If that environment already has the Python packages you need, use the user-managed environment.

from azureml.core.runconfig import RunConfiguration

# Edit a run configuration property on the fly.
run_local = RunConfiguration()

run_local.environment.python.user_managed_dependencies = True

附加计算并配置运行后,下一步是提交训练运行Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Azure 机器学习计算Azure Machine Learning Compute

Azure 机器学习计算是一个托管的计算基础结构,可让用户轻松创建单节点或多节点计算。Azure Machine Learning Compute is a managed-compute infrastructure that allows the user to easily create a single or multi-node compute. 该计算是在工作区区域内部创建的,是可与工作区中的其他用户共享的资源。The compute is created within your workspace region as a resource that can be shared with other users in your workspace. 提交作业时,计算会自动扩展,并可以放入 Azure 虚拟网络。The compute scales up automatically when a job is submitted, and can be put in an Azure Virtual Network. 计算在容器化环境中执行,将模型的依赖项打包在 Docker 容器中。The compute executes in a containerized environment and packages your model dependencies in a Docker container.

可以使用 Azure 机器学习计算在云中的 CPU 或 GPU 计算节点群集之间分配训练进程。You can use Azure Machine Learning Compute to distribute the training process across a cluster of CPU or GPU compute nodes in the cloud. 有关包括 GPU 的 VM 大小的详细信息,请参阅 GPU 优化的虚拟机大小For more information on the VM sizes that include GPUs, see GPU-optimized virtual machine sizes.

Azure 机器学习计算对可以分配的核心数等属性实施默认限制。Azure Machine Learning Compute has default limits, such as the number of cores that can be allocated. 有关详细信息,请参阅管理和请求 Azure 资源的配额For more information, see Manage and request quotas for Azure resources.

可以在计划运行时按需创建 Azure 机器学习计算环境,或者将其创建为持久性资源。You can create an Azure Machine Learning compute environment on demand when you schedule a run, or as a persistent resource.

基于运行的创建Run-based creation

可将 Azure 机器学习计算创建为运行时的计算目标。You can create Azure Machine Learning Compute as a compute target at run time. 将自动为运行创建计算。The compute is automatically created for your run. 完成运行后,会自动删除计算。The compute is deleted automatically once the run completes.

重要

Azure 机器学习计算的基于运行的创建功能目前为预览版。Run-based creation of Azure Machine Learning compute is currently in Preview. 如果使用自动化超参数优化或自动化机器学习,请不要使用基于运行的创建。Don't use run-based creation if you use automated hyperparameter tuning or automated machine learning. 若要使用超参数优化或自动化机器学习,请改为创建持久性计算目标。To use hyperparameter tuning or automated machine learning, create a persistent compute target instead.

  1. 创建、附加和配置:基于运行的创建执行创建、附加和配置运行配置的计算目标所需的所有步骤。Create, attach, and configure: The run-based creation performs all the necessary steps to create, attach, and configure the compute target with the run configuration.
from azureml.core.compute import ComputeTarget, AmlCompute

# First, list the supported VM families for Azure Machine Learning Compute
print(AmlCompute.supported_vmsizes(workspace=ws))

from azureml.core.runconfig import RunConfiguration
# Create a new runconfig object
run_temp_compute = RunConfiguration()

# Signal that you want to use AmlCompute to execute the script
run_temp_compute.target = "amlcompute"

# AmlCompute is created in the same region as your workspace
# Set the VM size for AmlCompute from the list of supported_vmsizes
run_temp_compute.amlcompute.vm_size = 'STANDARD_D2_V2'

附加计算并配置运行后,下一步是提交训练运行Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

持久性计算Persistent compute

可在不同的作业中重复使用持久性 Azure 机器学习计算。A persistent Azure Machine Learning Compute can be reused across jobs. 计算可与工作区中的其他用户共享,完成每个作业后可以保留。The compute can be shared with other users in the workspace and is kept between jobs.

  1. 创建和附加:若要在 Python 中创建持久性 Azure 机器学习计算资源,请指定vm_sizemax_nodes属性。Create and attach: To create a persistent Azure Machine Learning Compute resource in Python, specify the vm_size and max_nodes properties. 然后,Azure 机器学习将对其他属性使用智能默认值。Azure Machine Learning then uses smart defaults for the other properties. 计算在不使用时自动缩减为零个节点。The compute autoscales down to zero nodes when it isn't used. 按需创建专用 VM 来运行作业。Dedicated VMs are created to run your jobs as needed.

    • vm_size: Azure 机器学习计算创建的节点的 vm 系列。vm_size: The VM family of the nodes created by Azure Machine Learning Compute.
    • max_nodes:在 Azure 机器学习计算上运行作业时最多可自动缩放的节点数。max_nodes: The max number of nodes to autoscale up to when you run a job on Azure Machine Learning Compute.
    from azureml.core.compute import ComputeTarget, AmlCompute
    from azureml.core.compute_target import ComputeTargetException
    
    # Choose a name for your CPU cluster
    cpu_cluster_name = "cpucluster"
    
    # Verify that cluster does not exist already
    try:
        cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
        print('Found existing cluster, use it.')
    except ComputeTargetException:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                               max_nodes=4)
        cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    
    cpu_cluster.wait_for_completion(show_output=True)
    

    还可以在创建 Azure 机器学习计算时配置多个高级属性。You can also configure several advanced properties when you create Azure Machine Learning Compute. 使用这些属性可以创建固定大小的持久性群集,或者在订阅中的现有 Azure 虚拟网络内创建持久性群集。The properties allow you to create a persistent cluster of fixed size, or within an existing Azure Virtual Network in your subscription. 有关详细信息,请参阅 AmlCompute 类See the AmlCompute class for details.

    也可以在Azure 机器学习 studio中创建并附加持久性 Azure 机器学习计算资源。Or you can create and attach a persistent Azure Machine Learning Compute resource in Azure Machine Learning studio.

  2. 配置:为永久性计算目标创建运行配置。Configure: Create a run configuration for the persistent compute target.

    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    from azureml.core.runconfig import DEFAULT_CPU_IMAGE
    
    # Create a new runconfig object
    run_amlcompute = RunConfiguration()
    
    # Use the cpu_cluster you created above. 
    run_amlcompute.target = cpu_cluster
    
    # Enable Docker
    run_amlcompute.environment.docker.enabled = True
    
    # Set Docker base image to the default CPU-based image
    run_amlcompute.environment.docker.base_image = DEFAULT_CPU_IMAGE
    
    # Use conda_dependencies.yml to create a conda environment in the Docker image for execution
    run_amlcompute.environment.python.user_managed_dependencies = False
    
    # Specify CondaDependencies obj, add necessary packages
    run_amlcompute.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])
    

附加计算并配置运行后,下一步是提交训练运行Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

远程虚拟机Remote virtual machines

Azure 机器学习还支持将自己的计算资源附加到工作区。Azure Machine Learning also supports bringing your own compute resource and attaching it to your workspace. 这种类型的资源类型是任意远程 VM,只要可从 Azure 机器学习访问。One such resource type is an arbitrary remote VM, as long as it's accessible from Azure Machine Learning. 该资源可以是 Azure VM,也可以是组织内部或本地的远程服务器。The resource can be an Azure VM, a remote server in your organization, or on-premises. 具体而言,在指定 IP 地址和凭据(用户名和密码,或 SSH 密钥)的情况下,可以使用任何可访问的 VM 进行远程运行。Specifically, given the IP address and credentials (user name and password, or SSH key), you can use any accessible VM for remote runs.

可以使用系统生成的 conda 环境、现有的 Python 环境或 Docker 容器。You can use a system-built conda environment, an already existing Python environment, or a Docker container. 若要在 Docker 容器中执行,必须在 VM 上运行 Docker 引擎。To execute on a Docker container, you must have a Docker Engine running on the VM. 需要一个比本地计算机更灵活的基于云的开发/试验环境时,此功能特别有用。This functionality is especially useful when you want a more flexible, cloud-based dev/experimentation environment than your local machine.

请对此方案使用 Data Science Virtual Machine (DSVM) 作为 Azure VM。Use the Azure Data Science Virtual Machine (DSVM) as the Azure VM of choice for this scenario. 此 VM 在 Azure 中预配置了数据科学和 AI 开发环境。This VM is a pre-configured data science and AI development environment in Azure. 此 VM 提供精选的工具和框架用于满足整个机器学习开发生命周期的需求。The VM offers a curated choice of tools and frameworks for full-lifecycle machine learning development. 有关如何将 DSVM 与 Azure 机器学习配合使用的详细信息,请参阅配置开发环境For more information on how to use the DSVM with Azure Machine Learning, see Configure a development environment.

  1. Create:创建 DSVM,然后使用它来训练模型。Create: Create a DSVM before using it to train your model. 若要创建此资源,请参阅预配适用于 Linux (Ubuntu) 的 Data Science Virtual MachineTo create this resource, see Provision the Data Science Virtual Machine for Linux (Ubuntu).

    警告

    Azure 机器学习仅支持运行 Ubuntu 的虚拟机。Azure Machine Learning only supports virtual machines that run Ubuntu. 创建 VM 或选择现有 VM 时,必须选择使用 Ubuntu 的 VM。When you create a VM or choose an existing VM, you must select a VM that uses Ubuntu.

  2. 附加:若要将现有的虚拟机附加为计算目标,必须为虚拟机提供完全限定的域名(FQDN)、用户名和密码。Attach: To attach an existing virtual machine as a compute target, you must provide the fully qualified domain name (FQDN), user name, and password for the virtual machine. 在本示例中,请将 <fqdn> 替换为 VM 的 FQDN,或替换为公共 IP 地址。In the example, replace <fqdn> with the public FQDN of the VM, or the public IP address. 请将 <username> 和 <password> 替换为 VM 的 SSH 用户名和密码。Replace <username> and <password> with the SSH user name and password for the VM.

    from azureml.core.compute import RemoteCompute, ComputeTarget
    
    # Create the compute config 
    compute_target_name = "attach-dsvm"
    attach_config = RemoteCompute.attach_configuration(address = "<fqdn>",
                                                     ssh_port=22,
                                                     username='<username>',
                                                     password="<password>")
    
    # If you authenticate with SSH keys instead, use this code:
    #                                                  ssh_port=22,
    #                                                  username='<username>',
    #                                                  password=None,
    #                                                  private_key_file="<path-to-file>",
    #                                                  private_key_passphrase="<passphrase>")
    
    # Attach the compute
    compute = ComputeTarget.attach(ws, compute_target_name, attach_config)
    
    compute.wait_for_completion(show_output=True)
    

    或者,可以使用 Azure 机器学习 studio将 DSVM 附加到工作区。Or you can attach the DSVM to your workspace using Azure Machine Learning studio.

  3. 配置:创建 DSVM 计算目标的运行配置。Configure: Create a run configuration for the DSVM compute target. Docker 与 conda 用于在 DSVM 上创建和配置训练环境。Docker and conda are used to create and configure the training environment on the DSVM.

    import azureml.core
    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    
    run_dsvm = RunConfiguration(framework = "python")
    
    # Set the compute target to the Linux DSVM
    run_dsvm.target = compute_target_name 
    
    # Use Docker in the remote VM
    run_dsvm.environment.docker.enabled = True
    
    # Use the CPU base image 
    # To use GPU in DSVM, you must also use the GPU base Docker image "azureml.core.runconfig.DEFAULT_GPU_IMAGE"
    run_dsvm.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE
    print('Base Docker image is:', run_dsvm.environment.docker.base_image)
    
    # Specify the CondaDependencies object
    run_dsvm.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn'])
    

附加计算并配置运行后,下一步是提交训练运行Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Azure HDInsightAzure HDInsight

Azure HDInsight 是用于大数据分析的热门平台。Azure HDInsight is a popular platform for big-data analytics. 该平台提供的 Apache Spark 可用于训练模型。The platform provides Apache Spark, which can be used to train your model.

  1. Create:创建 HDInsight 群集,然后使用它来训练模型。Create: Create the HDInsight cluster before you use it to train your model. 若要在 HDInsight 群集中创建 Spark,请参阅在 HDInsight 中创建 Spark 群集To create a Spark on HDInsight cluster, see Create a Spark Cluster in HDInsight.

    创建群集时,必须指定 SSH 用户名和密码。When you create the cluster, you must specify an SSH user name and password. 请记下这些值,因为在将 HDInsight 用作计算目标时需要用到这些值。Take note of these values, as you need them to use HDInsight as a compute target.

    创建群集后,使用主机名 <clustername>-ssh.azurehdinsight.net 连接到该群集,其中,<clustername> 是为该群集提供的名称。After the cluster is created, connect to it with the hostname <clustername>-ssh.azurehdinsight.net, where <clustername> is the name that you provided for the cluster.

  2. 附加:若要将 hdinsight 群集附加为计算目标,必须提供 hdinsight 群集的主机名、用户名和密码。Attach: To attach an HDInsight cluster as a compute target, you must provide the hostname, user name, and password for the HDInsight cluster. 下面的示例使用 SDK 将群集附加到工作区。The following example uses the SDK to attach a cluster to your workspace. 在该示例中,请将 <clustername> 替换为群集名称。In the example, replace <clustername> with the name of your cluster. 请将 <username> 和 <password> 替换为群集的 SSH 用户名和密码。Replace <username> and <password> with the SSH user name and password for the cluster.

    from azureml.core.compute import ComputeTarget, HDInsightCompute
    from azureml.exceptions import ComputeTargetException
    
    try:
     # if you want to connect using SSH key instead of username/password you can provide parameters private_key_file and private_key_passphrase
     attach_config = HDInsightCompute.attach_configuration(address='<clustername>-ssh.azurehdinsight.net', 
                                                           ssh_port=22, 
                                                           username='<ssh-username>', 
                                                           password='<ssh-pwd>')
     hdi_compute = ComputeTarget.attach(workspace=ws, 
                                        name='myhdi', 
                                        attach_configuration=attach_config)
    
    except ComputeTargetException as e:
     print("Caught = {}".format(e.message))
    
    hdi_compute.wait_for_completion(show_output=True)
    

    或者,可以使用 Azure 机器学习 studio将 HDInsight 群集附加到工作区。Or you can attach the HDInsight cluster to your workspace using Azure Machine Learning studio.

  3. 配置:创建 HDI 计算目标的运行配置。Configure: Create a run configuration for the HDI compute target.

    from azureml.core.runconfig import RunConfiguration
    from azureml.core.conda_dependencies import CondaDependencies
    
    
    # use pyspark framework
    run_hdi = RunConfiguration(framework="pyspark")
    
    # Set compute target to the HDI cluster
    run_hdi.target = hdi_compute.name
    
    # specify CondaDependencies object to ask system installing numpy
    cd = CondaDependencies()
    cd.add_conda_package('numpy')
    run_hdi.environment.python.conda_dependencies = cd
    

附加计算并配置运行后,下一步是提交训练运行Now that you’ve attached the compute and configured your run, the next step is to submit the training run.

Azure BatchAzure Batch

Azure Batch 用于在云中高效运行大规模并行和高性能计算(HPC)应用程序。Azure Batch is used to run large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. 可以在 Azure 机器学习管道中使用 AzureBatchStep 将作业提交到 Azure Batch 计算机池。AzureBatchStep can be used in an Azure Machine Learning Pipeline to submit jobs to an Azure Batch pool of machines.

若要将 Azure Batch 附加为计算目标,必须使用 Azure 机器学习 SDK,并提供以下信息:To attach Azure Batch as a compute target, you must use the Azure Machine Learning SDK and provide the following information:

  • Azure Batch 计算名称:要用于工作区中计算的友好名称Azure Batch compute name: A friendly name to be used for the compute within the workspace
  • Azure Batch 帐户名称: Azure Batch 帐户的名称Azure Batch account name: The name of the Azure Batch account
  • 资源组:包含 Azure Batch 帐户的资源组。Resource Group: The resource group that contains the Azure Batch account.

下面的代码演示如何将 Azure Batch 附加为计算目标:The following code demonstrates how to attach Azure Batch as a compute target:

from azureml.core.compute import ComputeTarget, BatchCompute
from azureml.exceptions import ComputeTargetException

# Name to associate with new compute in workspace
batch_compute_name = 'mybatchcompute'

# Batch account details needed to attach as compute to workspace
batch_account_name = "<batch_account_name>"  # Name of the Batch account
# Name of the resource group which contains this account
batch_resource_group = "<batch_resource_group>"

try:
    # check if the compute is already attached
    batch_compute = BatchCompute(ws, batch_compute_name)
except ComputeTargetException:
    print('Attaching Batch compute...')
    provisioning_config = BatchCompute.attach_configuration(
        resource_group=batch_resource_group, account_name=batch_account_name)
    batch_compute = ComputeTarget.attach(
        ws, batch_compute_name, provisioning_config)
    batch_compute.wait_for_completion()
    print("Provisioning state:{}".format(batch_compute.provisioning_state))
    print("Provisioning errors:{}".format(batch_compute.provisioning_errors))

print("Using Batch compute:{}".format(batch_compute.cluster_resource_id))

在 Azure 机器学习 studio 中设置Set up in Azure Machine Learning studio

你可以在 Azure 机器学习 studio 中访问与工作区关联的计算目标。You can access the compute targets that are associated with your workspace in the Azure Machine Learning studio. 可以使用 studio 执行以下操作:You can use the studio to:

创建目标并将其附加到工作区后,可以通过一个 ComputeTarget 对象在运行配置中使用该目标:After a target is created and attached to your workspace, you will use it in your run configuration with a ComputeTarget object:

from azureml.core.compute import ComputeTarget
myvm = ComputeTarget(workspace=ws, name='my-vm-name')

查看计算目标View compute targets

若要查看工作区的计算目标,请使用以下步骤:To see the compute targets for your workspace, use the following steps:

  1. 导航到Azure 机器学习 studioNavigate to Azure Machine Learning studio.

  2. 在“应用程序”下,选择“计算”。Under Applications, select Compute.

    查看计算 "选项卡View compute tab

创建计算目标Create a compute target

遵循上述步骤查看计算目标的列表。Follow the previous steps to view the list of compute targets. 然后使用以下步骤创建计算目标:Then use these steps to create a compute target:

  1. 选择加号 (+) 添加计算目标。Select the plus sign (+) to add a compute target.

    添加计算目标

  2. 输入计算目标的名称。Enter a name for the compute target.

  3. 选择“机器学习计算”作为用于训练的计算类型。Select Machine Learning Compute as the type of compute to use for Training.

    备注

    Azure 机器学习计算是可以在 Azure 机器学习 studio 中创建的唯一托管的计算资源。Azure Machine Learning Compute is the only managed-compute resource you can create in Azure Machine Learning studio. 创建其他所有计算资源后,可以附加这些资源。All other compute resources can be attached after they are created.

  4. 填写表单。Fill out the form. 提供必需属性的值,尤其是“VM 系列”,以及用于运转计算的最大节点数Provide values for the required properties, especially VM Family, and the maximum nodes to use to spin up the compute.

  5. 选择“创建”。Select Create.

  6. 通过在列表中选择计算目标来查看创建操作的状态:View the status of the create operation by selecting the compute target from the list:

    选择一个计算目标以查看创建操作的状态

  7. 然后会看到相应计算目标的详细信息:You then see the details for the compute target:

    查看计算目标详细信息

附加计算目标Attach compute targets

若要使用在 Azure 机器学习工作区之外创建的计算目标,必须附加它们。To use compute targets created outside the Azure Machine Learning workspace, you must attach them. 附加计算目标会使其可供你的工作区使用。Attaching a compute target makes it available to your workspace.

遵循上述步骤查看计算目标的列表。Follow the steps described earlier to view the list of compute targets. 然后使用以下步骤来附加计算目标:Then use the following steps to attach a compute target:

  1. 选择加号 (+) 添加计算目标。Select the plus sign (+) to add a compute target.

  2. 输入计算目标的名称。Enter a name for the compute target.

  3. 选择要为训练附加的计算类型:Select the type of compute to attach for Training:

    重要

    并非所有计算类型都可以从 Azure 机器学习 studio 中连接。Not all compute types can be attached from Azure Machine Learning studio. 目前,可为训练附加的计算类型包括:The compute types that can currently be attached for training include:

    • 远程 VMA remote VM
    • Azure Databricks(在机器学习管道中使用)Azure Databricks (for use in machine learning pipelines)
    • Azure Data Lake Analytics(在机器学习管道中使用)Azure Data Lake Analytics (for use in machine learning pipelines)
    • Azure HDInsightAzure HDInsight
  4. 填写表单,并提供必需属性的值。Fill out the form and provide values for the required properties.

    备注

    Microsoft 建议使用 SSH 密钥,因为它们比密码更安全。Microsoft recommends that you use SSH keys, which are more secure than passwords. 密码很容易受到暴力破解攻击。Passwords are vulnerable to brute force attacks. SSH 密钥依赖于加密签名。SSH keys rely on cryptographic signatures. 若要了解如何创建用于 Azure 虚拟机的 SSH 密钥,请参阅以下文档:For information on how to create SSH keys for use with Azure Virtual Machines, see the following documents:

  5. 选择“附加”。Select Attach.

  6. 通过在列表中选择计算目标来查看附加操作的状态。View the status of the attach operation by selecting the compute target from the list.

设置 CLISet up with CLI

您可以使用 Azure 机器学习的CLI 扩展访问与工作区关联的计算目标。You can access the compute targets that are associated with your workspace using the CLI extension for Azure Machine Learning. 可以使用 CLI 执行以下操作:You can use the CLI to:

  • 创建托管计算目标Create a managed compute target
  • 更新托管计算目标Update a managed compute target
  • 附加托管计算目标Attach an unmanaged compute target

有关详细信息,请参阅资源管理For more information, see Resource management.

设置 VS CodeSet up with VS Code

你可以使用 Azure 机器学习的VS Code 扩展来访问、创建和管理与工作区关联的计算目标。You can access, create and manage the compute targets that are associated with your workspace using the VS Code extension for Azure Machine Learning.

使用 Azure 机器学习 SDK 提交培训运行Submit training run using Azure Machine Learning SDK

创建运行配置后,可以使用它来运行试验。After you create a run configuration, you use it to run your experiment. 对于所有类型的计算目标,用于提交训练运行的代码模式都是相同的:The code pattern to submit a training run is the same for all types of compute targets:

  1. 创建要运行的试验Create an experiment to run
  2. 提交运行任务。Submit the run.
  3. 等待运行任务完成。Wait for the run to complete.

重要

提交训练运行时,将创建包含定型脚本的目录的快照,并将其发送到计算目标。When you submit the training run, a snapshot of the directory that contains your training scripts is created and sent to the compute target. 它也作为实验的一部分存储在工作区中。It is also stored as part of the experiment in your workspace. 如果更改文件并再次提交运行,则只会上载已更改的文件。If you change files and submit the run again, only the changed files will be uploaded.

若要防止文件包含在快照中, 请在目录中创建 .gitignore.amlignore文件, 并将文件添加到其中。To prevent files from being included in the snapshot, create a .gitignore or .amlignore file in the directory and add the files to it. .amlignore文件使用与 .gitignore 文件相同的语法和模式。The .amlignore file uses the same syntax and patterns as the .gitignore file. 如果这两个文件都存在,则 .amlignore 文件优先。If both files exist, the .amlignore file takes precedence.

有关详细信息,请参阅快照For more information, see Snapshots.

创建试验Create an experiment

首先,在工作区中创建一个试验。First, create an experiment in your workspace.

from azureml.core import Experiment
experiment_name = 'my_experiment'

exp = Experiment(workspace=ws, name=experiment_name)

提交试验Submit the experiment

使用 ScriptRunConfig 对象提交该试验。Submit the experiment with a ScriptRunConfig object. 此对象包含:This object includes the:

  • source_directory:包含定型脚本的源目录source_directory: The source directory that contains your training script
  • 脚本:标识训练脚本script: Identify the training script
  • run_config:运行配置,该配置又定义了定型发生的位置。run_config: The run configuration, which in turn defines where the training will occur.

例如,若要使用本地目标配置:For example, to use the local target configuration:

from azureml.core import ScriptRunConfig
import os 

script_folder = os.getcwd()
src = ScriptRunConfig(source_directory = script_folder, script = 'train.py', run_config = run_local)
run = exp.submit(src)
run.wait_for_completion(show_output = True)

使用不同的运行配置(例如 amlcompute 目标)将同一试验切换为在不同的计算目标中运行:Switch the same experiment to run in a different compute target by using a different run configuration, such as the amlcompute target:

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory = script_folder, script = 'train.py', run_config = run_amlcompute)
run = exp.submit(src)
run.wait_for_completion(show_output = True)

提示

此示例默认为仅使用计算目标的一个节点进行定型。This example defaults to only using one node of the compute target for training. 若要使用多个节点,请将运行配置的 node_count 设置为所需的节点数。To use more than one node, set the node_count of the run configuration to the desired number of nodes. 例如,下面的代码将用于定型的节点数设置为4:For example, the following code sets the number of nodes used for training to four:

src.run_config.node_count = 4

或者可以:Or you can:

有关详细信息,请参阅ScriptRunConfigRunConfiguration文档。For more information, see the ScriptRunConfig and RunConfiguration documentation.

使用 Azure 机器学习 CLI 创建运行配置并提交运行Create run configuration and submit run using Azure Machine Learning CLI

您可以使用Azure CLI机器学习 CLI 扩展来创建运行配置,并将运行中的运行提交到不同的计算目标。You can use Azure CLI and Machine Learning CLI extension to create run configurations and submit runs on different compute targets. 以下示例假定你已有 Azure 机器学习工作区,并且已使用 az login CLI 命令登录到 Azure。The following examples assume that you have an existing Azure Machine Learning Workspace and you have logged in to Azure using az login CLI command.

创建运行配置Create run configuration

创建运行配置的最简单方法是导航包含机器学习 Python 脚本的文件夹,并使用 CLI 命令The simplest way to create run configuration is to navigate the folder that contains your machine learning Python scripts, and use CLI command

az ml folder attach

此命令创建一个子文件夹 .azureml,其中包含不同计算目标的模板运行配置文件。This command creates a subfolder .azureml that contains template run configuration files for different compute targets. 可以复制和编辑这些文件,以自定义配置,例如添加 Python 包或更改 Docker 设置。You can copy and edit these files to customize your configuration, for example to add Python packages or change Docker settings.

运行配置文件的结构Structure of run configuration file

运行配置文件的格式 YAML,以下部分The run configuration file is YAML formatted, with following sections

  • 要运行的脚本及其参数The script to run and its arguments
  • 计算目标名称,可以是 "本地",也可以是工作区中计算的名称。Compute target name, either "local" or name of a compute under the workspace.
  • 用于执行运行的参数:框架,用于分布式运行的 communicator,最大持续时间和计算节点数。Parameters for executing the run: framework, communicator for distributed runs, maximum duration, and number of compute nodes.
  • 环境部分。Environment section. 有关本部分中的字段的详细信息,请参阅创建和管理用于培训和部署的环境See Create and manage environments for training and deployment for details of the fields in this section.
    • 若要指定要为运行安装的 Python 包,请创建conda 环境文件,并设置__condaDependenciesFile__字段。To specify Python packages to install for the run, create conda environment file, and set condaDependenciesFile field.
  • 运行历史记录详细信息以指定日志文件文件夹,以及启用或禁用输出收集和运行历史记录快照。Run history details to specify log file folder, and to enable or disable output collection and run history snapshots.
  • 特定于所选框架的配置详细信息。Configuration details specific to the framework selected.
  • 数据引用和数据存储的详细信息。Data reference and data store details.
  • 用于创建新群集的机器学习计算的特定配置详细信息。Configuration details specific for Machine Learning Compute for creating a new cluster.

有关完整的 .runconfig 架构,请参阅示例JSON 文件See the example JSON file for a full runconfig schema.

创建试验Create an experiment

首先,为您的运行创建一个试验First, create an experiment for your runs

az ml experiment create -n <experiment>

脚本运行Script run

若要提交脚本运行,请执行命令To submit a script run, execute a command

az ml run submit-script -e <experiment> -c <runconfig> my_train.py

HyperDrive 运行HyperDrive run

可以将 HyperDrive 与 Azure CLI 结合使用来执行参数优化运行。You can use HyperDrive with Azure CLI to perform parameter tuning runs. 首先,创建以下格式的 HyperDrive 配置文件。First, create a HyperDrive configuration file in the following format. 有关超参数优化参数的详细信息,请参阅优化模型的超参数一文。See Tune hyperparameters for your model article for details on hyperparameter tuning parameters.

# hdconfig.yml
sampling: 
    type: random # Supported options: Random, Grid, Bayesian
    parameter_space: # specify a name|expression|values tuple for each parameter.
    - name: --penalty # The name of a script parameter to generate values for.
      expression: choice # supported options: choice, randint, uniform, quniform, loguniform, qloguniform, normal, qnormal, lognormal, qlognormal
      values: [0.5, 1, 1.5] # The list of values, the number of values is dependent on the expression specified.
policy: 
    type: BanditPolicy # Supported options: BanditPolicy, MedianStoppingPolicy, TruncationSelectionPolicy, NoTerminationPolicy
    evaluation_interval: 1 # Policy properties are policy specific. See the above link for policy specific parameter details.
    slack_factor: 0.2
primary_metric_name: Accuracy # The metric used when evaluating the policy
primary_metric_goal: Maximize # Maximize|Minimize
max_total_runs: 8 # The maximum number of runs to generate
max_concurrent_runs: 2 # The number of runs that can run concurrently.
max_duration_minutes: 100 # The maximum length of time to run the experiment before cancelling.

将此文件与运行配置文件一起添加。Add this file alongside the run configuration files. 然后使用以下内容提交 HyperDrive 运行:Then submit a HyperDrive run using:

az ml run submit-hyperdrive -e <experiment> -c <runconfig> --hyperdrive-configuration-name <hdconfig> my_train.py

请注意 .runconfig 和 HyperDrive config 中的参数空间中的参数部分。它们包含要传递给训练脚本的命令行参数。Note the arguments section in runconfig and parameter space in HyperDrive config. They contain the command-line arguments to be passed to training script. .Runconfig 中的值在每次迭代中保持不变,而 HyperDrive config 中的范围将循环访问。The value in runconfig stays the same for each iteration, while the range in HyperDrive config is iterated over. 不要在这两个文件中指定相同的参数。Do not specify the same argument in both files.

有关这些 az ml CLI 命令和完整参数集的更多详细信息,请参阅参考文档For more details on these az ml CLI commands and full set of arguments, see the reference documentation.

Git 跟踪和集成Git tracking and integration

当你开始在源目录为本地 Git 存储库的训练运行时,有关存储库的信息存储在运行历史记录中。When you start a training run where the source directory is a local Git repository, information about the repository is stored in the run history. 有关详细信息,请参阅Git integration for Azure 机器学习For more information, see Git integration for Azure Machine Learning.

Notebook 示例Notebook examples

请参阅以下 Notebook,其中提供了有关使用各种计算目标进行训练的示例:See these notebooks for examples of training with various compute targets:

按照文章使用 Jupyter Notebook 来探索此服务了解如何运行 Notebook。Learn how to run notebooks by following the article, Use Jupyter notebooks to explore this service.

后续步骤Next steps