您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

机器学习操作过程The machine learning operations process

模型开发过程The model development process

开发过程应该会产生以下结果:The development process should produce the following outcomes:

  • 培训是自动化的,并验证了模型,其中包括测试功能和性能 (例如,使用准确性指标) 。Training is automated, and models are validated, which includes testing functionality and performance (for example, using accuracy metrics).

  • 部署到用于推断 (的基础结构,包括监视) 是自动进行的。Deployment to the infrastructure used for inferencing (including monitoring) is automated.

  • 机制创建端到端数据审核线索。Mechanisms create an end-to-end data audit trail. 当数据偏离随着时间的推移(与大规模的计算机学习 infused 系统相关),会发生自动重新训练。Automatic model retraining occurs when data drifts over time, which is relevant to large-scale, machine-learning-infused systems.

下图描绘了机器学习系统的部署生命周期:The following diagram depicts the deployment lifecycle of a machine learning system:

机器学习生命周期的关系图。

一旦开发完成,就会对机器学习模型进行训练、验证、部署和监视。Once developed, a machine learning model is trained, validated, deployed, and monitored. 从组织的角度来看,在 "管理" 和 "技术" 级别,定义拥有和实施此过程的人员非常重要。From an organizational perspective and on the managerial and technical level, it's important to define who owns and implements this process. 在较大的企业中,数据科学家可能拥有模型定型和验证步骤,并且机器学习工程师可能会执行剩余的步骤。In larger enterprises, a data scientist might own the model training and validation steps, and a machine learning engineer might tend to the remaining steps. 在小型公司中,数据科学家可能拥有所有步骤。In smaller companies, a data scientist might own all steps.

定型模型Train the model

在此步骤中,训练数据集训练机器学习模型。In this step, a training dataset trains the machine learning model. 训练代码是版本控制的,并且可重复使用,此功能优化了按钮单击和事件触发器 (例如,新版本的数据变为可用) 以便自动对模型进行定型。The training code is version-controlled and reusable, and this feature optimizes button clicks and event triggers (such as a new version of the data becoming available) to automate how the model is trained.

验证模型Validate the model

此步骤使用已建立的指标(如准确性指标)来自动验证新训练的模型,并将其与旧模型进行比较。This step uses established metrics like an accuracy metric to automatically validate the newly trained model and compare it to older ones. 它的准确性是否增加?Did its accuracy increase? 如果是,则可以在模型注册表中注册此模型,以确保后续步骤可以使用它。If yes, this model might be registered in the model registry for ensuring that next steps can consume it. 如果新模型的性能更糟,则会提醒数据科研人员调查为什么或放弃新定型的模型。If the new model performs worse, then a data scientist can be alerted to investigate why or discard the newly trained model.

部署模型Deploy the model

在部署步骤中,将该模型部署为 web 应用程序的 API 服务。Deploy the model as an API service for web applications in the deployment step. 此方法使模型可以独立于应用程序进行缩放和更新。This approach enables the model to be scaled and updated independently of the applications. 或者,可以使用模型来执行批处理评分,其中使用一次或定期计算新数据点的预测。Alternatively, the model can be used to perform batch scoring where it's used once or periodically to calculate predictions on new data points. 当需要异步处理大量数据时,这很有用。This is useful when large amounts of data need to be processed asynchronously. 在部署期间,可以在 机器学习推理中 找到有关部署模型的更多详细信息。More details on deployment models can be found on the machine learning inference during deployment page.

监视模型Monitor the model

出于两个主要原因,需要监视模型。It's necessary to monitor the model for two key reasons. 首先,监视模型有助于确保其在技术上正常运行;例如,能够生成预测。First, monitoring the model helps to ensure that it's technically functional; for example, able to generate predictions. 如果组织的应用程序依赖于模型并实时使用该模型,则这一点非常重要。This is important if an organization's applications depend on the model and use it in real time. 监视模型还有助于组织不断地生成有用的预测。Monitoring the model also helps organizations to gauge if it continuously generates useful predictions. 当发生数据偏移(例如,用于定型模型的数据与在预测阶段发送到模型的数据不同)时,这可能不起作用。This might not be useful when data drift occurs, such as when the data used to train the model significantly differs from the data that's sent to the model during prediction phase. 例如,为向年轻人用户推荐产品而训练的模型可能会在向不同年龄组中的人员推荐产品时产生不良结果。For example, a model trained for recommending products to young people might produce undesirable results when recommending products to people from a different age group. 具有数据偏移的模型监视可以检测到这种类型的不匹配、警报机器学习工程师,并通过更多相关或较新的数据自动重新训练模型。Model monitoring with data drift can detect this type of mismatch, alert machine learning engineers, and automatically retrain the model with more relevant or newer data.

如何监视模型How to monitor models

由于数据偏移、季节性或较新的体系结构经过优化以实现更好的性能,因此,随着时间的推移,可能会导致模型性能 wane,因此必须建立一个持续部署模型的过程。Since data drift, seasonality, or newer architecture tuned for better performance can all cause model performance to wane over time, it's important to establish a process to continuously deploy models. 一些最佳实践包括:Some best practices include:

  • 所有权: 应将所有者分配到模型性能监视过程,以主动管理其性能。Ownership: An owner should be assigned to the model performance monitoring process to actively manage its performance.

  • 发布管道: 首先在 Azure DevOps 中设置一个发布管道,并将触发器设置为模型注册表。Release pipelines: Set up a release pipeline in Azure DevOps first, and set the trigger to the model registry. 如果在注册表中注册了新的模型,则发布管道会在部署过程中触发并注销。When a new model is registered in the registry, the release pipeline triggers and signs off on a deployment process.

重新训练模型的先决条件Prerequisites for retraining models

从生产模型收集数据 是在持续集成/持续开发框架中重新训练模型的先决条件之一,此过程使用来自评分请求的输入数据。Collecting data from models in production is one prerequisite to retraining models in a continuous integration/continuous development framework, and this process uses input data from scoring requests. 此功能当前仅限于可以通过最少的格式设置和操作分析为 JSON 的表格数据;排除视频、音频和图像。This capability is currently limited to tabular data that can be parsed as JSON with minimal formatting and manipulation; video, audio, and images are excluded. 此功能可用于 Azure Kubernetes 服务 (AKS) 上的模型。This capability is available for models on the Azure Kubernetes Service (AKS). 收集的数据存储在 Azure blob 中。The collected data is stored in an Azure blob.

准备重新训练模型:To prepare for retraining a model:

  1. 监视从收集的输入数据的数据偏移。Monitor data drift from the input data collected. 设置监视过程需要从生产数据中提取时间戳。Setting up a monitoring process requires extracting the timestamp from the production data. 这是比较生产数据和基线数据 (用于构建模型) 的定型数据所必需的。This is required to compare the production data and the baseline data (the training data used to build the model). 监视数据偏移的首选方法是通过 Azure Monitor Application Insights。The preferred way to monitor data drift is through Azure Monitor Application Insights. 此功能提供可触发电子邮件、短信文本、推送或 Azure Functions 等操作的 警报This feature provides an alert that can trigger actions like email, SMS text, push, or Azure Functions. 需要 启用 Application Insights 才能记录数据。You need to enable Application Insights to log data.

  2. 分析收集的数据。Analyze the collected data. 请确保 从生产模型中收集数据,并将结果包括在模型评分脚本中。Make sure to collect data from models in production, and include the results in the model scoring script. 收集用于模型计分的所有功能,因为这样可以确保所有必需的功能都存在,并可用作定型数据。Collect all features used for model scoring, as this ensures that all necessary features are present and can be used as training data.

  3. 决定是否需要重新训练收集的数据。Decide whether retraining with the collected data is necessary. 许多因素都会导致数据偏移,包括传感器问题到季节性、用户行为发生变化以及与数据源相关的数据质量问题。Many things cause data drift, including sensor issues to seasonality, changes in user behavior, and data quality issues related to the data source. 所有情况下都不需要进行模型重新训练,因此在采用之前,建议调查并了解数据偏移的原因。Model retraining isn't required in all cases, so it's recommended to investigate and understand the cause of the data drift before pursuing this.

  4. 重新训练模型。Retrain the model. 模型定型应该已经自动进行,此步骤涉及触发当前训练步骤。Model training should already be automated, and this step involves triggering the current training step. 这可能适用于检测到数据偏差 (的情况,并且它与) 的数据问题无关,或者数据工程师发布了新版本的数据集。This could be for when data drift has been detected (and it isn't related to a data issue), or when a data engineer has published a new version of a dataset. 根据用例,这些步骤可以完全自动执行,也可以由人工监督。Depending on the use case, these steps can be fully automated or supervised by a human. 例如,尽管某些用例(如产品建议)在将来可能会自主运行,但财务中的其他用例会成为模型公平和透明度等标准因素,并要求人工批准新定型模型。For example, while some use cases like product recommendations could run autonomously in the future, others in finance would factor standards like model fairness and transparency and require a human to approve newly trained models.

首先,组织通常只自动执行模型的训练和部署,而不是手动执行的验证、监视和重新训练步骤。At first, it's common for an organization to only automate a model's training and deployment but not the validation, monitoring, and retraining steps, which are performed manually. 最终,这些任务的自动化步骤可以在达到所需状态之前进行。Eventually, automation steps for these tasks can progress until the desired state is achieved. DevOps 和机器学习操作是在一段时间内开发的概念,组织应知道其发展。DevOps and machine learning operations are concepts that develop over time, and organizations should be aware of their evolution.

Team Data Science Process 生命周期The Team Data Science Process lifecycle

Team Data Science Process (TDSP) 提供用于构建数据科学项目开发的生命周期。The Team Data Science Process (TDSP) provides a lifecycle to structure the development of your data science projects. 该生命周期概述了项目通常执行的主要阶段(通常以迭代方式进行):The lifecycle outlines the major stages that projects typically execute, often iteratively:

  • 了解业务Business understanding
  • 数据获取和理解Data acquisition and understanding
  • 建模Modeling
  • 部署Deployment

团队数据科学过程生命周期中介绍了 TDSP 生命周期的每个阶段的目标、任务和文档项目。The goals, tasks, and documentation artifacts for each stage of the TDSP lifecycle are described in the Team Data Science Process lifecycle.

机器学习操作中的角色和活动The roles and activities within machine learning operations

基于 TDSP 生命周期,AI 项目中的主要角色是数据工程、数据科学家和机器学习操作工程师。Per the TDSP lifecycle, the key roles in the AI project are data engineer, data scientist, and machine learning operations engineer. 这些角色对项目的成功至关重要,并且必须协同工作以实现准确、可重复、可缩放且随时可用的生产解决方案。These roles are critical to your project's success and must work together toward accurate, repeatable, scalable, and production-ready solutions.

显示机器学习操作过程的关系图。A diagram showing the machine learning operations process.

  • 数据工程: 此角色引入、验证和清理数据。Data engineer: This role ingests, validates, and cleans the data. 数据经过优化后,数据科学家就可以对其进行编目和使用。Once the data is refined, it's cataloged and made available for data scientists to use. 在此阶段,需要探索和分析重复数据、删除离群值并识别丢失的数据,这一点很重要。At this stage, it's important to explore and analyze duplicate data, remove outliers, and identify missing data. 这些活动应在管道步骤中定义,并在训练管道经过预处理时执行。These activities should be defined in the pipeline steps and are executed as the train pipeline is preprocessed. 应将唯一名称和特定名称分配给核心和生成的功能。Unique and specific names should be assigned to core and generated features.

  • 数据科学家 (或 AI 工程) : 此角色导航定型管道过程并评估模型。Data scientist (or AI engineer): This role navigates the training pipeline process and evaluates models. 数据科学家从数据工程人员那里接收数据,并标识其中的模式和关系,从而可能为试验选择或生成功能。A data scientist receives data from the data engineer and identifies patterns and relationships within it, possibly selecting or generating features for the experiment. 由于特征工程在生成声音通用化模型方面扮演着重要的角色,因此,此阶段的关键是要彻底完成。Since feature engineering plays a major role in building a sound generalized model, it's key for this phase to be completed as thoroughly as possible. 可以通过不同的算法和超参数来执行各种试验。Various experiments can be performed with different algorithms and hyperparameters. Azure 工具(如自动机器学习)可以自动执行此任务,这还有助于过度拟合模型。Azure tools like automated machine learning can automate this task, which can also help with under- and overfitting a model. 已成功训练的模型随后会在模型注册表中注册。A successfully trained model is then registered in the model registry. 该模型应具有唯一的特定名称,并且应保留版本历史记录以供跟踪。The model should have a unique and specific name, and a version history should be retained for traceability.

  • 机器学习操作工程师: 此角色生成用于持续集成和交付的端到端管道。Machine learning operations engineer: This role builds end-to-end pipelines for continuous integration and delivery. 这包括将模型打包到 Docker 映像中、验证和分析模型、等待利益干系人批准以及在容器业务流程服务(如 AKS)中部署模型。This includes packing the model in a Docker image, validating and profiling the model, awaiting approval from a stakeholder, and deploying the model in a container orchestration service such as AKS. 可以在持续集成期间设置各种触发器,模型的代码可以在以后触发定型管道和发布管道。Various triggers can be set during continuous integration, and the model's code can trigger the train pipeline and the release pipeline afterward.