您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

部署过程中的机器学习推理Machine learning inference during deployment

在生产过程中部署 AI 模型时,需要考虑它将如何进行预测。When deploying your AI model during production, you need to consider how it will make predictions. AI 模型的两个主要过程是:The two main processes for AI models are:

  • 批处理推理: 基于一批观察值的异步进程。Batch inference: An asynchronous process that bases its predictions on a batch of observations. 预测以文件的形式存储在最终用户或业务应用程序的数据库中。The predictions are stored as files or in a database for end users or business applications.

  • 实时 (或交互式) 推理: 释放模型以便随时进行预测并触发立即响应。Real-time (or interactive) inference: Frees the model to make predictions at any time and trigger an immediate response. 此模式可用于分析流式处理和交互式应用程序数据。This pattern can be used to analyze streaming and interactive application data.

请考虑以下问题来评估您的模型,比较两个流程,并选择一个适合您的模型的问题:Consider the following questions to evaluate your model, compare the two processes, and select the one that suits your model:

  • 预测应生成的频率是多少?How often should predictions be generated?
  • 结果需要多久?How soon are the results needed?
  • 预测是单独生成、分批生成还是批量生成?Should predictions be generated individually, in small batches, or in large batches?
  • 是否需要从模型中延迟?Is latency to be expected from the model?
  • 执行模型需要多少计算能力?How much compute power is needed to execute the model?
  • 维护模型是否有运营影响和成本?Are there operational implications and costs to maintain the model?

下面的决策树可帮助你确定哪种部署模型最适合你的用例:The following decision tree can help you to determine which deployment model best fits your use case:

实时或批处理推理决策树的关系图。A diagram of the real-time or batch inference decision tree.

批量推理Batch inference

批处理推理,有时称为离线推理,是一种更简单的推理过程,可帮助模型在时间间隔和业务应用程序中运行以存储预测。Batch inference, sometimes called offline inference, is a simpler inference process that helps models to run in timed intervals and business applications to store predictions.

请考虑以下批处理推理的最佳实践:Consider the following best practices for batch inference:

  • 触发器批处理评分: 使用 Azure 机器学习中 Azure 机器学习管道和 ParallelRunStep 功能来设置计划或基于事件的自动化。Trigger batch scoring: Use Azure Machine Learning pipelines and the ParallelRunStep feature in Azure Machine Learning to set up a schedule or event-based automation. 导航到 AI show,使用 ParallelRunStep Azure 机器学习执行批处理推理,并详细了解该过程。Navigate to the AI show to perform batch inference using Azure Machine Learning ParallelRunStep and learn more about the process.

  • 批处理推理的计算选项: 由于批处理推理过程不连续运行,因此建议自动启动、停止和缩放可重复使用的群集,以便处理一系列工作负荷。Compute options for batch inference: Since batch inference processes don't run continuously, it's recommended to automatically start, stop, and scale reusable clusters that can handle a range of workloads. 不同的模型需要不同的环境,并且你的解决方案需要能够部署特定的环境,并在推理结束后将其删除,以便计算可用于下一个模型。Different models require different environments, and your solution needs to be able to deploy a specific environment and remove it when inference is over for the compute to be available for the next model. 请参阅下面的决策树,确定模型的正确计算实例:See the following decision tree to identify the right compute instance for your model:

    计算决策树的关系图。A diagram of the compute decision tree.

  • 实现批处理推理: Azure 为批处理推理支持多种功能。Implement batch inference: Azure supports multiple features for batch inference. 其中一项功能是 ParallelRunStep Azure 机器学习,它允许客户从存储在 Azure 中的数 tb 的结构化或非结构化数据中获取见解。One feature is ParallelRunStep in Azure Machine Learning, which allows customers to gain insights from terabytes of structured or unstructured data stored in Azure. ParallelRunStep 提供现成的并行度并在 Azure 机器学习管道中工作。ParallelRunStep provides out-of-the-box parallelism and works within Azure Machine Learning pipelines.

  • 批处理推理难题: 虽然批处理推理是在生产环境中使用和部署模型的一种更简单的方法,但它确实提供了选择的挑战:Batch inference challenges: While batch inference is a simpler way to use and deploy your model in production, it does present select challenges:

    • 根据推理的运行频率,生成的数据可能在访问时不相关。Depending on the frequency at which inference runs, the data produced could be irrelevant by the time it's accessed.

    • 冷启动问题的变化形式;结果可能不适用于新数据。A variation of the cold-start problem; results might not be available for new data. 例如,如果新用户创建和帐户并开始使用零售推荐系统进行购物,则在下一次批处理推理运行之前,产品建议将不可用。For example, if a new user creates and account and starts shopping with a retail recommendation system, product recommendations won't be available until after the next batch inference run. 如果这是你用例的障碍,则请考虑实时推理。If this is an obstacle for your use case, consider real-time inference.

    • 部署到多个区域和高可用性在批处理推理方案中并不重要。Deploying to many regions and high availability aren't critical concerns in a batch inference scenario. 不需要将模型部署到突破,并且可能需要在多个位置使用高可用性策略来部署数据存储。The model doesn't need to be deployed regionally, and the data store might need to be deployed with a high-availability strategy in many locations. 这通常遵循应用程序 HA 设计和策略。This will normally follow the application HA design and strategy.

实时推理Real-time inference

实时或交互式推理是一种体系结构,在该体系结构中,可以随时触发模型推理,并应立即做出响应。Real-time, or interactive, inference is architecture where model inference can be triggered at any time, and an immediate response is expected. 此模式可用于分析流式处理数据、交互式应用程序数据和其他信息。This pattern can be used to analyze streaming data, interactive application data, and more. 此模式允许您实时利用机器学习模型,并解决上述批处理推理中所述的冷启动问题。This mode allows you to take advantage of your machine learning model in real time and resolves the cold-start problem outlined above in batch inference.

如果实时推理适用于你的模型,则可以使用以下注意事项和最佳做法:The following considerations and best practices are available if real-time inference is right for your model:

  • 实时推理的难题: 延迟和性能要求使你的模型的实时推理体系结构更复杂。The challenges of real-time inference: Latency and performance requirements make real-time inference architecture more complex for your model. 系统可能需要在100毫秒或更短的时间内响应,在这段时间内需要检索数据、执行推理、验证和存储模型结果、运行任何所需的业务逻辑,并将结果返回给系统或应用程序。A system might need to respond in 100 milliseconds or less, during which it needs to retrieve the data, perform inference, validate and store the model results, run any required business logic, and return the results to the system or application.

  • 用于实时推理的计算选项: 实现实时推理的最佳方式是将模型以容器形式部署到 Docker 或 AKS 群集,并使用 REST API 将其公开为 web 服务。Compute options for real-time inference: The best way to implement real-time inference is to deploy the model in a container form to Docker or AKS cluster and expose it as a web-service with REST API. 这样一来,模型将在其自己的隔离环境中执行,并可以像管理任何其他 web 服务一样进行管理。This way, the model executes in its own isolated environment and can be managed like any other web service there. 然后,可以将 Docker/AKS 功能用于管理、监视、扩展等。Docker/AKS capabilities can then be used for management, monitoring, scaling, and more. 可以在本地、在云中或在边缘部署此模型。The model can be deployed on-premises, in the cloud, or on the edge. 上述计算决策概述了实时推理。The preceding compute decision outlines real-time inference.

  • Multiregional 部署和高可用性: 需要在实时推理方案中考虑区域部署和高可用性体系结构,因为延迟和模型的性能将是解决问题的关键所在。Multiregional deployment and high availability: Regional deployment and high availability architectures need to be considered in real-time inference scenarios, as latency and the model's performance will be critical to resolve. 为了减少 multiregional 部署中的延迟,建议将模型尽可能靠近消耗点。To reduce latency in multiregional deployments, it's recommended to locate the model as close as possible to the consumption point. 模型和支持基础结构应遵循业务的高可用性和灾难恢复原则和策略。The model and supporting infrastructure should follow the business' high availability and DR principles and strategy.

多模型方案Many-models scenario

单一模型可能无法捕获实际问题的复杂性质,如预测一个超市的销售额,其中人口统计、品牌、Sku 和其他功能可能会导致客户行为变化明显。A singular model might not be able to capture the complex nature of real-world problems, such as predicting sales for a supermarket where demographics, brand, SKUs, and other features could cause customer behavior to vary significantly. 区域可能会导致智能计量器的预测维护也发生显著变化。Regions could cause developing predictive maintenance for smart meters to also vary significantly. 在这些情况下,为这些方案捕获区域数据或存储级关系的多个模型可能产生比单个模型更高的准确性。Having many models for these scenarios to capture regional data or store-level relationships could produce higher accuracy than a single model. 此方法假定有足够的数据可用于此级别的粒度。This approach assumes that enough data is available for this level of granularity.

在高级别中,多模型方案分为三个阶段:数据源、数据科学和多个模型。At a high level, a many-models scenario occurs in three stages: data source, data science, and many models.

多模型方案的关系图。A diagram of a many-models scenario.

数据源: 数据源阶段中不包含太多基数的数据分段很重要。Data source: It's important to segment data without too many cardinalities in the data source stage. 产品 ID 或条形码不应分解为主分区,因为这会产生过多的段,并且可能会阻碍有意义的模型。The product ID or barcode shouldn't be factored into the main partition, as this will produce too many segments and could inhibit meaningful models. 品牌、SKU 或位置可能更适合用于功能。The brand, SKU, or locality could be more fitting features. 还必须通过删除可能会导致数据分布歪斜的异常来 homogenize 数据。It's also important to homogenize the data by removing anomalies that would skew data distribution.

数据科学: 多个试验并行运行到数据科学阶段的每个数据分区。Data science: Several experiments run parallel to each data partition in the data science stage. 这是一个典型的迭代过程,在此过程中,会评估试验中的模型以确定最佳模型。This is a typically iterative process where models from the experiments are evaluated to determine the best one.

许多模型: 在模型注册表中注册每个段或类别的最佳模型。Many models: The best models for each segment or category are registered in the model registry. 为模型指定有意义的名称,这将使它们更容易被推断。Assign meaningful names to the models, which will make them more discoverable for inference. 在必要时使用标记将模型分组为特定类别。Use tagging where necessary to group the model into specific categories.

许多模型的批处理推理Batch inference for many models

在许多模型的批处理推理过程中,通常会定期安排预测,并可处理同时运行的大量数据。During batch inference for many models, predictions are typically scheduled, recurring, and they can handle large volumes of data running at the same time. 不同于单模型方案,许多模型同时推理,因此选择正确的模型很重要。Unlike in a single-model scenario, many models inference at the same time, and it's important to select the correct ones. 下图显示了许多模型批处理推理的引用模式:The following diagram shows the reference pattern for many-models batch inference:

用于多模型批处理推理的引用模式的关系图。A diagram of the reference pattern for many-models batch inference.

此模式的核心用途是观察模型并同时运行多个模型,以实现可处理大型数据卷的高度可缩放推理解决方案。The core purpose of this pattern is to observe the model and run multiple models simultaneously to achieve a highly scalable inference solution that can handle large data volumes. 若要实现分层模型推理,可以将许多模型拆分为多个类别。To achieve hierarchical model inference, many models can be split into categories. 每个类别都可以有自己的推理存储,如 Azure data lake。Each category can have its own inference storage, like an Azure data lake. 实现此模式时,需要平衡水平和垂直缩放模型,因为这会影响成本和性能。When implementing this pattern, one needs to balance scaling the models horizontally and vertically, as this would impact the cost and performance. 运行太多模型实例可能会提高性能,但会影响成本。Running too many model instances might increase performance but impact the cost. 太少具有高规格节点的实例可能更具成本效益,但可能会导致缩放问题。Too few instances with high spec nodes might be more cost effective but could cause issues with scaling.

许多模型的实时推理Real-time inference for many models

实时多模型推断需要低延迟和按需请求,通常通过 REST 终结点进行。Real-time many-models inference requires low latency and on-demand requests, typically via a REST endpoint. 当外部应用程序或服务需要标准接口与模型交互时,这很有用,通常通过包含 JSON 有效负载的 REST 接口进行。This is useful when external applications or services require a standard interface to interact with the model, typically via a REST interface with a JSON payload.

多模型实时推理的关系图。A diagram of many-models real-time inference.

此模式的核心用途是使用发现服务标识服务及其元数据的列表。The core purpose of this pattern is to use the discovery service to identify a list of services and their metadata. 这可以作为 Azure 函数实现,并使客户端能够获取服务的相关服务详细信息,可使用安全 REST URI 调用。This can be implemented as an Azure function and enables clients to obtain relevant service details of service, that can be invoked with a secure REST URI. JSON 有效负载将发送到服务,这会获得相关模型并向客户端提供 JSON 响应。A JSON payload be sent to the service, which would summon the relevant model and provide a JSON response back to the client.

每个服务都是无状态微服务,可同时处理多个请求,并限制为物理虚拟机资源。Each service is stateless microservice that can handle multiple requests simultaneously and is limited to the physical virtual machine resource. 如果选择了多个组,则该服务可以部署多个模型;为此,建议使用同类分组,如类别、SKU 等。The service can deploy multiple models if multiple groups are selected; homogeneous groupings like the category, SKU, and more are recommended for this. 服务请求与为给定服务选择的模型之间的映射需要融入到推理逻辑,通常通过评分脚本进行。The mapping between the service request and model selected for a given service needs to be baked into the inference logic, typically via the score script. 如果模型的大小相对较小 (几个 mb) ,则建议出于性能原因将其加载到内存中;否则,每个请求可以动态加载每个模型。If the size of models is relatively small (a few megabytes), it's recommended to load them in memory for performance reasons; otherwise, each model can be loaded dynamically per request.

后续步骤Next steps

浏览以下资源,了解有关推断的详细信息 Azure 机器学习:Explore the following resources to learn more about inferencing with Azure Machine Learning: