您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

整理和设置 Azure 机器学习环境Organize and set up Azure Machine Learning environments

规划企业环境的 Azure 机器学习部署时,有一些常见决策点会影响创建工作区的方式:When planning an Azure Machine Learning deployment for an enterprise environment, there are some common decision points that affect how you create the workspace:

  • 团队结构: 在给定用例和数据隔离或成本管理要求的情况下,你的机器学习团队组织和协作处理的方式。Team structure: The way your Machine Learning teams are organized and collaborate on projects given use case and data segregation, or cost management requirements.

  • 环境: 作为开发和发布工作流的一部分使用的环境,用于将开发与生产隔离开来。Environments: The environments used as part of your development and release workflow to segregate development from production.

  • 区域: 你的数据的位置,以及为你的机器学习解决方案提供服务所需的受众。Region: The location of your data and the audience you need to serve your Machine Learning solution to.

团队结构和工作区设置Team structure and workspace setup

工作区是 Azure 机器学习中的顶级资源。The workspace is the top-level resource in Azure Machine Learning. 它存储在使用机器学习和托管计算以及指向附加资源和关联资源的指针时生成的项目。It stores the artifacts produced when working with Machine Learning and the managed compute and pointers to attached and associated resources. 从可管理性角度来看,作为 Azure 资源管理器资源的工作区允许使用 azure 基于角色的访问控制 (Azure RBAC) 、按策略进行管理,并可用作成本报表的单元。From a manageability standpoint, the workspace as an Azure Resource Manager resource allows for Azure role-based access control (Azure RBAC), management by Policy, and can be used as a unit for cost reporting.

组织通常选择以下一种或多种解决方案模式,以遵循可管理性要求。Organizations typically choose one or a combination of the following solution patterns to follow manageability requirements.

工作区(每个团队):当团队的所有成员都需要对数据和试验资产具有相同级别的访问权限时,选择为每个团队使用一个工作区。Workspace per team: Choose to use one workspace for each team when all members of a team require the same level of access to data and experimentation assets. 例如,具有三个机器学习团队的组织可能会创建三个工作区,每个团队一个。For example, an organization with three machine learning teams might create three workspaces, one for each team.

每个团队使用一个工作区的好处在于,团队项目的所有机器学习项目都存储在一个位置。The benefit of using one workspace per team is that all Machine Learning artifacts for the team’s projects are stored in one place. 提高工作效率可实现,因为团队成员可以轻松地访问、浏览和重复使用试验结果。Productivity increases can be realized because team members can easily access, explore, and reuse experimentation results. 按团队组织工作区可减少 Azure 需求量,并按团队简化成本管理。Organizing your workspaces by team reduces your Azure footprint and simplifies cost management by team. 由于试验资产的数量可能会快速增长,因此你可以按命名和标记约定来使你的项目保持井然有序。Because the number of experimentation assets can grow quickly, you can keep your artifacts organized by following naming and tagging conventions. 有关如何命名资源的建议,请参阅 开发 Azure 资源的命名和标记策略For recommendations about how to name resources, see Develop your naming and tagging strategy for Azure resources.

此方法的注意事项是每个团队成员都必须具有类似的数据访问级别权限。A consideration for this approach is each team member must have similar data access level permissions. 用于数据源和试验资产 (ACL) 的精细 RBAC 和访问控制列表在工作区中受到限制。Granular RBAC and access control lists (ACL) for data sources and experimentation assets are limited within a workspace. 不能有用例数据隔离要求。You can’t have use case data segregation requirements.

每个项目的工作区: 如果需要按项目分离数据和试验资产,或者在项目级别具有成本报表和预算要求,请选择对每个项目使用一个工作区。Workspace per project: Choose to use one workspace for each project if you require segregation of data and experimentation assets by project, or have cost reporting and budgeting requirements at a project level. 例如,具有四个机器学习团队的组织每个都运行三个项目,可能会创建12个工作区实例。For example, an organization with four machine learning teams that each runs three projects, might create 12 workspace instances.

对每个项目使用一个工作区的好处是,可以在项目级别管理成本。The benefit of using one workspace per project is that costs can be managed at the project level. 由于类似的原因,团队通常为 Azure 机器学习和关联的资源创建专用资源组。Teams typically create a dedicated resource group for Azure Machine Learning and associated resources for similar reasons. 例如,在使用外部参与者时,项目中心的工作区可简化项目的协作,因为外部用户只需获得对项目资源的访问权限,而不是团队资源。When you work with external contributors, for example, a project-centered workspace simplifies collaboration on a project because external users only need to be granted access to the project resources, not the team resources.

使用此方法时,需要考虑试验结果和资产的隔离。A consideration with this approach is the isolation of experimentation results and assets. 资产的发现和重复使用可能更难,因为资产分布在多个工作区实例中。The discovery and reuse of the assets might be more difficult because of assets being spread across multiple workspace instances.

单个工作区: 选择将一个工作区用于非团队或非项目相关的工作,或者当成本不能直接关联到特定的计费单位时(例如,使用 R&D)。Single Workspace: Choose to use one workspace for non-team or non-project related work, or when costs can’t be directly associated to a specific unit of billing, for example with R&D.

此设置的优点是,与项目相关的成本可能会与项目相关的工作成本分离。The benefit of this setup is the cost of individual, non-project related work can be decoupled from project-related costs. 为所有用户设置单个工作区来完成各自的工作时,可减少 Azure 占用空间。When you set up a single workspace for all users to do their individual work, you reduce your Azure footprint.

此方法的一个注意事项是,如果许多机器学习的专业人员共享同一实例,工作区可能会变得混乱。A consideration for this approach is the workspace might become cluttered quickly when many Machine Learning practitioners share the same instance. 用户可能需要基于 UI 的资产筛选来有效地查找其资源。Users might require UI-based filtering of assets to effectively find their resources. 您可以为每个业务部门创建共享机器学习工作区,以减轻规模问题或细分预算。You can create shared Machine Learning workspaces for each business division to mitigate scale concerns or to segment budgets.

环境和工作区设置Environments and workspace setup

环境是根据应用程序生命周期中的阶段部署目标的资源集合。An environment is a collection of resources that deployments target based on their stage in the application lifecycle. 环境名称的常见示例包括开发、测试、QA、过渡和生产。Common examples of environment names are Dev, Test, QA, Staging, and Production.

你的组织中的开发过程会影响环境使用的要求。The development process in your organization affects requirements for environment usage. 环境会影响 Azure 机器学习和关联资源的设置,例如附加的计算。Your environment affects the setup of Azure Machine Learning and associated resources, for example attached compute. 例如,数据可用性可能会对每个环境都有机器学习实例的可管理性施加限制。For example, data availability might put constraints on the manageability of having a Machine Learning instance available for each environment. 以下是常见的解决方案模式:The following solution patterns are common:

单一环境工作区部署: 选择单一环境工作区部署时,Azure 机器学习部署到一个环境中。Single environment workspace deployment: When you choose a single environment workspace deployment, Azure Machine Learning is deployed to one environment. 此设置通常适用于研究中心方案,在这些方案中,无需根据环境中的生命周期阶段来发布机器学习项目。This setup is common for research-centered scenarios, where there is no need to release Machine Learning artifacts based on their lifecycle stage, across environments. 此设置有意义的另一种情况是,在环境中仅部署推断服务(而不是机器学习管道)。Another scenario where this setup makes sense is when only inferencing services, and not Machine Learning pipelines, are deployed across environments.

以研究为中心的设置的好处是,Azure 需求量更小,管理开销最小。The benefit of a research-centered setup is a smaller Azure footprint and minimal management overhead. 这样做意味着无需在每个环境中部署 Azure 机器学习工作区。This way of working implies no need to have an Azure Machine Learning workspace deployed in each environment.

此方法的一个考虑因素是,单一环境部署受数据可用性的限制。A consideration for this approach is a single environment deployment is subject to data availability. 设置数据存储时需要小心。Caution is required with the Datastore set up. 如果设置了广泛的访问权限(例如,对生产数据源的写入访问权限),则可能会意外损害数据质量。If you set up extensive access, for example, writer access on production data sources, you might unintentionally harm data quality. 如果在开发完成的同一环境中将工作投入生产,则相同的 RBAC 限制适用于开发工作和生产工作。If you bring work to production in the same environment where development is done, the same RBAC restrictions apply for both the development work and the production work. 此设置可能使这两种环境过于严格或过于灵活。This setup might make both environments too rigid or too flexible.

单环境部署

多个环境工作区部署: 当你选择多个环境工作区部署时,将为每个环境部署一个工作区实例。Multiple environment workspace deployment: When you choose a multiple environment workspace deployment, a workspace instance is deployed for each environment. 此安装程序的一个常见方案是受管控的工作区,在不同的环境中,以及对这些环境具有资源访问权限的用户。A common scenario for this setup is a regulated workplace with a clear separation of duties between environments, and for users who have resource access to those environments.

此设置的优点是:The benefits of this setup are:

  • 分步推出机器学习工作流和项目。Staged rollout of Machine Learning workflows and artifacts. 例如,跨环境的模型,并可能提高灵活性并缩短部署时间。For example, models across environments, with the potential of enhancing agility and reducing time-to-deployment.

  • 增强了资源的安全性和控制,因为你可以在下游环境中分配更多的访问限制。Enhanced security and control of resources because you have the ability to assign more access restrictions in downstream environments.

  • 针对非开发环境中的生产数据定型方案,因为你可以向一组选择的用户授予访问权限。Training scenarios on production data in non-development environments because you can give a select group of users access.

此方法的一个注意事项是您面临更多管理和处理开销的风险,因为此安装程序需要对工作区实例中的机器学习项目执行细化的开发和推出过程。A consideration for this approach is you are at risk for more management and process overhead since this setup requires a fine-grained development and rollout process for Machine Learning artifacts across workspace instances. 此外,可能还需要进行数据管理和工程工作以使生产数据可用于开发环境中的培训。Additionally, data management and engineering effort might be required to make production data available for training in the development environment. 若要使团队能够在生产环境中解决和调查事件,访问管理是必需的。Access management is required for you to give a team access to resolve and investigate incidents in production. 最后,你的团队需要 Azure DevOps 和机器学习工程专业知识来实现自动化工作流。And finally, Azure DevOps and Machine Learning engineering expertise is needed on your team to implement automation workflows.

多个环境部署

一种数据访问受限的环境,一种使用生产数据访问的环境: 当你选择此设置时,Azure 机器学习将部署到两个环境–一个具有受限的数据访问的环境,以及一个具有生产数据访问的环境。One environment with limited data access, one with production data access: When you choose this setup, Azure Machine Learning is deployed to two environments – one environment that has limited data access, and one environment that has production data access. 如果需要分离开发和生产环境,则此设置很常见。This setup is common if you have a requirement to segregate development and production environments. 例如,如果您正在使用组织约束来使生产数据在任何环境中可用,或者您想要将生产工作与生产工作分离,而不会因维护成本高昂而导致数据重复,则不需要再复制数据。For example, if you are working under organizational constraints to make production data available in any environment or when you want to segregate development work from production work without duplicating data more than required due to the high cost of maintenance.

此设置的优点是在开发和生产环境之间明确地分隔职责和访问权限。The benefit of this setup is the clear separation of duties and access between development and production environments. 另一个好处是与多环境部署方案相比,资源管理开销较低。Another benefit is lower resource management overhead when compared to a multi-environment deployment scenario.

此方法的一个注意事项是需要为工作区中的机器学习项目定义的开发和部署过程。A consideration for this approach a defined development and rollout process for Machine Learning artifacts across workspaces is required. 需要考虑的另一个因素是数据管理和工程努力,使生产数据可用于开发环境中的培训。Another consideration is data management and engineering effort might be required to make production data available for training in a development environment. 但是,它可能需要比多环境工作区部署更少的工作量。However, it might require relatively less effort than a multi-environment workspace deployment.

一种具有有限的数据访问的环境,一种环境使用生产数据访问

区域和资源设置Regions and resource setup

你的资源、数据或用户的位置可能需要你在多个 Azure 区域创建 Azure 机器学习的工作区实例和关联的资源。The location of your resources, data, or users, might require you to create Azure Machine Learning workspace instances and associated resources in multiple Azure regions. 例如,一个项目可能跨西欧和美国东部 Azure 区域的资源,出于性能、成本和合规性原因。For example, one project might span its resources across the West Europe and East US Azure regions for performance, cost, and compliance reasons. 以下是常见方案:The following scenarios are common:

区域培训: 机器学习培训作业在数据所在的同一 Azure 区域中运行。Regional training: The machine learning training jobs run in the same Azure region as where the data is located. 在此设置中,会将一个机器学习工作区部署到数据所在的每个 Azure 区域。In this setup, a Machine Learning workspace is deployed to each Azure region where data is located. 当你在符合性下操作时,或者在跨区域的数据移动约束时,这种情况很常见。It's a common scenario when you are acting under compliance, or when you have data movement constraints across regions.

此设置的优点是,可以在数据所在的数据中心进行试验,网络延迟最小。The benefit of this setup is experimentation can be done in the data center where the data is located with the least network latency. 此方法的一个注意事项是,在多个工作区实例中运行机器学习管道时,这会增加管理的复杂性。A consideration for this approach is when a Machine Learning pipeline is run across multiple workspace instances, it adds more management complexity. 跨实例比较试验结果并增加配额和计算管理的开销,这会变得很困难。It becomes challenging to compare experimentation results across instances and adds overhead to quota and compute management.

如果要跨区域附加存储空间,但使用一个区域中的计算,Azure 机器学习支持在区域而非工作区中附加存储帐户的方案。If you want to attach storage across regions, but use compute from one region, Azure Machine Learning supports the scenario of attaching storage accounts in a region rather than the workspace. 元数据(例如度量值)将存储在工作区区域中。Metadata, for example metrics, will be stored in the workspace region.

区域培训

区域服务: 将机器学习服务部署到目标受众所在位置附近。Regional serving: Machine Learning services are deployed close to where the target audience lives. 例如,如果目标用户位于澳大利亚,并且西欧了主存储和试验区域,请在西欧中部署用于试验的机器学习工作区,并部署 AKS 群集以用于澳大利亚的推理终结点部署。For example, if target users are in Australia and the main storage and experimentation region is West Europe, deploy the Machine Learning workspace for experimentation in West Europe, and deploy an AKS cluster for inference endpoint deployment in Australia.

此设置的优点是数据中心的推断机会,其中新数据是引入的,最大程度减少延迟和数据移动,并符合本地法规。The benefits of this setup are the opportunity for inferencing in the data center where new data is ingested, minimizing latency and data movement, and compliance with local regulations.

此方法的一个注意事项是多区域设置提供了若干优点,还增加了配额和计算管理的开销。A consideration for this approach is a multi-region setup provides several advantages, it also adds more overhead on quota and compute management. 如果需要批推断,区域服务可能需要多工作区部署。When there is a requirement for batch inferencing, regional serving might require a multi-workspace deployment. 通过推断终结点收集的数据可能需要跨区域传输,以便进行重新训练。Data collected through inferencing endpoints might require to be transferred across regions for retraining scenarios.

区域服务

区域微调: 基础模型在初始数据集(例如,来自所有区域的公共数据或数据)上定型,并使用区域数据集进行微调。Regional fine-tuning: A base model is trained on an initial dataset, for example, public data or data from all regions, and is later fine-tuned with a regional dataset. 区域数据集可能仅存在于特定区域,因为符合性或数据移动约束。The regional dataset might only exist in a particular region because of compliance or data movement constraints. 例如,可以在区域 A 的工作区中完成基本模型训练,而在区域 B 中的工作区中进行微调可能完成。For example, base model training might be done in a workspace in region A, while fine tuning might be done in a workspace in region B.

此设置的优点是可与数据所在的数据中心符合要求,并在较早的管道阶段利用更大数据集上的基本模型定型。The benefit of this setup is experimentation is available in compliance with the data center where the data resides, and still takes advantage of base model training on a larger dataset in an earlier pipeline stage.

需要考虑的是,这种方法可为复杂的试验管道提供能力,但它可能会带来更多挑战。A consideration is this approach provides the ability for complex experimentation pipelines, however it might create more challenges. 例如,在区域之间比较试验结果,增加配额和计算管理的开销。For example, comparing experiment results across regions and more adding more overhead to quota and compute management.

区域微调

参考实现Reference implementation

为了说明 Azure 机器学习在较大的设置中的部署,本部分概述组织 "Contoso" 是如何设置 Azure 机器学习组织的限制、报告和预算要求的:To illustrate the deployment of Azure Machine Learning in a larger setting, this section outlines how the organization 'Contoso' has set up Azure Machine Learning given their organizational constraints, reporting, and budgeting requirements:

  • Contoso 基于解决方案创建资源组,以便进行成本管理和报告。Contoso creates resource groups on a solution basis for cost management and reporting reasons.

  • IT 管理员只需为投资解决方案创建资源组和资源即可满足预算要求。IT administrators only create resource groups and resources for funded solutions to meet budget requirements.

  • 由于数据科学的探索和不确定的性质,用户需要有一个地方来试验和处理用例和数据浏览。Because of the explorative and uncertain nature of Data Science, there’s a need for users to have a place to experiment and work for use case and data exploration. 探索工作多次不能直接与特定用例相关联,并且只能关联到 R&D 预算。Explorative work many times can’t be directly associated to a particular use case, and can be associated only to R&D budget. Contoso 打算集中为某些机器学习资源提供资金,使任何人都可以进行探索。Contoso is looking to fund some Machine Learning resources centrally that anyone can use for exploration purposes.

  • 机器学习用例证明在探索环境中成功后,团队可以请求资源组。Once a Machine Learning use case proves to be successful in the explorative environment, teams can request resource groups. 例如,用于迭代试验项目工作的开发、QA 和生产,可以设置对生产数据源的访问权限。For example, Dev, QA, and Prod for iterative experimentation project work, and access to production data sources can be set up.

  • 数据隔离和合规性要求不允许实时生产数据存在于开发环境中Data segregation and compliance requirements don’t allow live production data to exist in development environments

  • 每个环境的每个用户组都存在不同的 RBAC 要求,例如,在生产环境中访问权限更严格。Different RBAC requirements exist for various user groups by IT policy per environment, for example access is more restrictive in production.

  • 所有数据、试验和推断都在单个 Azure 区域中完成。All data, experimentation, and inferencing is done in a single Azure region.

为了符合上述要求,Contoso 已按以下方式设置其资源:To adhere to the above requirements, Contoso has set up their resources in the following way:

  • Azure 机器学习工作区和资源组按项目的作用域,以遵循预算和用例隔离要求。Azure Machine Learning workspaces and resource groups are scoped per project to follow budgeting and use case segregation requirements.
  • Azure 机器学习和关联资源的多环境设置,以满足成本管理、RBAC 和数据访问要求。A multiple-environment setup for Azure Machine Learning and associated resources to address cost management, RBAC, and data access requirements.
  • 单个资源组和机器学习专用于浏览的工作区。A single resource group and Machine Learning workspace that is dedicated for exploration.
  • 每个用户角色和环境的 Azure Active Directory 组不同,例如,数据科学家可在生产环境中执行的操作与开发环境中的不同,每个解决方案可能会有不同的访问级别。Azure Active Directory groups that are different per user role and environment, for example operations that a data scientist can do in a production environment are different than in the development environment, and access levels might differ per solution.
  • 在单个 Azure 区域中创建所有资源All resources are created in a single Azure region

Contoso 参考实现