您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

如何处理机器学习操作How to approach machine learning operations

机器学习操作包含有关如何以可缩放的方式组织和标准化计算机模型开发、部署和维护的原则和最佳实践。Machine learning operations consist of principles and best practices about how to organize and standardize machine model development, deployment, and maintenance in a scalable way.


下面概述了机器学习系统开发方式的主要组件:The main components of how a machine learning system develops are outlined below:


_Sculley et al. 2015。_Sculley et al. 2015. 机器学习系统中隐藏的技术债务。Hidden technical debt in machine learning systems. 神经信息处理系统上的第28个国际会议的诉讼,第2卷 (NIPS 2015) . *Proceedings of the 28th International Conference on Neural Information Processing Systems, Volume 2 (NIPS 2015).*

机器学习操作与开发操作Machine learning operations vs. development operations

尽管开发操作 (DevOps) 影响机器学习操作,但其过程之间存在差异。While development operations (DevOps) influence machine learning operations, there are differences between their processes. 除了 DevOps 做法以外,机器学习操作还解决了 DevOps 中未涵盖的以下概念:In addition to DevOps practices, machine learning operations address the following concepts not covered in DevOps:

  • 数据版本控制: 必须存在代码版本控制和数据集版本控制,因为架构和实际数据会随时间变化。Data versioning: There must be code versioning and dataset versioning, as the schema and actual data can change over time. 这样,就可以重现数据,使数据对其他团队成员可见,并有助于审核用例。This allows data to be reproduced, makes the data visible other team members, and helps use cases to be audited.

  • 模型跟踪: 模型项目通常存储在模型注册表中,后者应标识存储、版本控制和标记功能。Model tracking: Model artifacts are often stored in a model registry that should identify storage, versioning, and tagging capabilities. 这些注册表需要标识源代码、其参数和用于定型模型的相应数据,所有这些操作都指示模型的创建位置。These registries need to identify the source code, its parameters, and the corresponding data used to train the model, all of which indicate where a model was created.

  • 数字审计线索: 使用代码和数据时,需要跟踪所有更改。Digital audit trail: When working with code and data, all changes need to be tracked.

  • 归纳: 模型不同于重用代码,因为模型必须根据输入数据或方案进行优化。Generalization: Models are different than code for reuse, as models must be tuned based on the input data or scenario. 您可能需要对新数据的模型进行微调,以便将其用于新的方案。You might need to fine-tune the model for the new data to use it for a new scenario.

  • 模型 重新训练:随着时间的推移,模型性能可能会下降,因此,重新训练模型以使其保持有用,这一点很重要。Model retraining: Model performance can decrease over time, and it's important to retrain models for them to remain useful.

机器学习操作的方法Approaches to machine learning operations

组织内的数据科学家应用各种经验、成熟度、技能和工具来试验机器学习操作。Data scientists within an organization apply a broad spectrum of experience, maturity, skills, and tools to experimenting with machine learning operations. 由于一定要鼓励尽可能多的参与者来接纳 AI,因此,对所有组织应如何进行机器学习操作并不是很有可能的。Since it's important to encourage as many participants as possible to embrace AI, a consensus about how all organizations should approach machine learning operations isn't likely or desirable. 在这种不同的情况下,你的组织的一个明确起点是了解其大小和资源将如何影响其机器学习操作方法。In light of this variety, a clear starting point for your organization is to understand how its size and resources will influence its approach to machine learning operations.

公司的规模和成熟度表明,具有独特角色的数据科学团队或个人是否拥有机器学习生命周期,并在每个阶段中导航。A company's size and maturity indicate if data science teams or individuals with unique roles will own the machine learning lifecycle and navigate each phase. 对于每种方案,最常见的生命周期方法如下:The following approaches to the lifecycle are the most common to each scenario:

集中式方法A centralized approach

数据科学团队可能会在资源和专家有限的小型公司内监视机器学习生命周期。Data science teams will likely monitor the machine learning lifecycle in small companies with limited resources and specialists. 此团队应用其技术专业知识来清理和聚合数据,开发和部署模型,以及监视和维护已部署的模型。This team applies their technical expertise to cleaning and aggregating data, developing and deploying models, and monitoring and maintaining deployed models.

此方法的一个优点是,它可以快速地将模型提升到生产环境,但这会增加成本,因为需要在数据科学团队中维护专门的技能水平。One advantage of this method is that it progresses the model quickly to production, but it increases costs because of the specialized skill levels that need to be maintained on the data science team. 当所需的专业知识水平不存在时,质量将会受到影响。Quality suffers when those required levels of expertise aren't present.

分散方法A decentralized approach

具有专门角色的人员可能会监视并负责拥有专用运营团队的公司中的机器学习生命周期。Individuals with specialized roles will likely monitor and be responsible for the machine learning lifecycle in companies with dedicated operations teams. 批准某个用例后,将分配机器学习工程师来评估其当前状态以及将其提升为可以支持的格式所需的工作量。Once a use case is approved, a machine learning engineer is assigned to assess its current state and the amount of work needed to elevate it to format that can be supported.

数据科研人员需要收集以下问题的信息:The data scientist will need to gather information for the following questions:

  • 谁将成为企业所有者?Who will be the business owner?
  • 如何使用模型?How will the model be consumed?
  • 需要哪种级别的服务可用性?What level of service availability will be needed?
  • 如何对模型进行重新训练?How will the model be retrained?
  • 模型将重新训练多长时间?How often will the model be retrained?
  • 谁将决定模型升级的条件?Who will decide the conditions for model promotion?
  • 用例的敏感程度如何,代码是否可共享?How sensitive is the use case, and is the code shareable?
  • 未来将如何修改模型和代码?How will the model and code be modified in the future?
  • 当前试验有多少?How much of the current experiment is reusable?
  • 是否有可帮助的现有项目工作流?Are there existing project workflows that can assist?
  • 将模型提升到生产环境需要多少工作?How much work will be required to advance the model to production?

接下来,机器学习工程师将设计工作流并估计所需工作量。Next, a machine learning engineer designs the workflow and estimates the amount of work required. 一种最佳做法是将数据科学家与构建工作流) (这一次提供了一个关键的机会,让他们在最终的存储库中进行交叉训练和熟悉,因为数据科研人员通常会在将来处理用例。One best practice is to involve the data scientist(s) in building out the workflow; this time presents a key opportunity to cross-train and familiarize them with the final repo since it's common for the data scientist to work on the use case in the future.

机器学习操作如何受益于业务How machine learning operations benefit business

机器学习操作连接传统开发操作、数据操作和数据科学/AI。Machine learning operations connect traditional development operations, data operations, and data science/AI. 了解机器学习操作如何使你的业务受益,有助于你的 AI 旅程。Understanding how machine learning operations can benefit your business will help your AI journey.

将机器学习操作与您的业务集成可带来以下好处:Integrating machine learning operations with your business can create the following benefits:

  • 企业模型管理可简化模型开发、培训、部署和操作化的生命周期。Enterprise model management streamlines and automates the lifecycle for model development, training, deployment, and operationalization. 这样,企业就可以灵活地响应即时需求,并以可重复和托管的方式对业务做出更改。This allows businesses to be agile and respond to immediate needs and business changes in a repeatable and managed way.

  • 模型版本控制和数据实现允许企业生成迭代和版本化模型,以调整数据或特定用例的细微差别。Model versioning and data realization allow the enterprise to generate iterated and versioned models to adjust to the nuances of the data or the particular use case. 这为响应业务挑战和更改提供灵活性和灵活性。This provides flexibility and agility in responding to business challenges and changes.

  • 当组织监视和管理其模型时,这有助于它们快速响应数据或方案中的重大更改。When organizations monitoring and manage their models, this helps them to quickly respond to significant changes in the data or the scenario. 例如,由于外部因素或基础数据发生更改,已实现的模型可能会遇到极端的数据偏移。For example, an implemented model might experience extreme data drift because of an external factor or a change in the underlying data. 这会使前面的模型不可用,并要求尽快重新训练当前模型。This would make the previous models unusable and require the current model to be retrained as soon as possible. 要跟踪的机器学习模型的准确性和性能。Machine learning models to be tracked for accuracy and performance. 它在更改影响模型的可靠性和性能时提醒利益干系人,这会导致快速重新训练和部署。It alerts stakeholders when changes impact model reliability and performance, which leads to quick retraining and deployment.

  • 通过在整个开发生命周期内进行快速审核、合规性、管理和访问控制,应用的机器学习操作流程支持业务成果。Applied machine learning operations processes support business outcomes by allowing rapid auditing, compliance, governance, and access control throughout the development lifecycle. 当业务中发生更改时,模型生成、数据使用和法规遵从性的可见性是显而易见的。The visibility of model generation, data usage, and regulatory compliance is clear as changes take place in the business.

后续步骤Next steps