建立操作适用性评审Establish an operational fitness review
当企业开始在 Azure 中运行工作负荷时，下一步就是建立 运营健身评审 流程。As your enterprise begins to operate workloads in Azure, the next step is to establish a process for operational fitness review. 此过程枚举、实现并以迭代方式查看这些工作负荷的非 功能性要求 。This process enumerates, implements, and iteratively reviews the nonfunctional requirements for these workloads. 非功能性要求与服务的预期操作行为相关。Nonfunctional requirements are related to the expected operational behavior of the service.
- 成本优化Cost optimization
- 卓越运营Operational excellence
- 性能效率Performance efficiency
运营健身评审流程可确保关键任务工作负荷符合业务要求，并与支柱相关。A process for operational fitness review ensures that your mission-critical workloads meet the expectations of your business with respect to the pillars.
创建一个流程，以便充分了解在生产环境中运行工作负荷导致的问题，以及如何修正和解决这些问题。Create a process for operational fitness review to fully understand the problems that result from running workloads in a production environment, and how to remediate and resolve those problems. 本文概述了企业可用于实现此目标的运营健身评审的高级过程。This article outlines a high-level process for operational fitness review that your enterprise can use to achieve this goal.
Microsoft 的操作适用性Operational fitness at Microsoft
从一开始，Microsoft 的许多团队都参与了 Azure 平台开发。From the outset, many teams across Microsoft have been involved in the development of the Azure platform. 很难确保此类大小和复杂性的项目的质量和一致性。It's difficult to ensure quality and consistency for a project of such size and complexity. 你需要一种可靠的过程来定期枚举和实现基本的非功能性要求。You need a robust process to enumerate and implement fundamental nonfunctional requirements on a regular basis.
Microsoft 遵循的流程构成了本文中所述过程的基础。The processes that Microsoft follows form the basis for the processes outlined in this article.
了解问题Understand the problem
如入门中所述 ：加速迁移，企业数字转换的第一步是通过采用 Azure 来确定要解决的业务问题。As discussed in Get started: Accelerate migration, the first step in an enterprise's digital transformation is to identify the business problems to be solved by adopting Azure. 下一步是确定问题的高级解决方案，如将工作负载迁移到云，或改编现有的本地服务以包含云功能。The next step is to determine a high-level solution to the problem, such as migrating a workload to the cloud or adapting an existing, on-premises service to include cloud functionality. 最后，设计和实现解决方案。Finally, you design and implement the solution.
在此过程中，焦点通常在服务的功能上：你希望服务 执行的功能要求集 。During this process, the focus is often on the features of the service: the set of functional requirements that you want the service to perform. 例如，产品交付服务需要确定产品的源位置和目标位置、在交付过程中跟踪产品以及向客户发送通知的功能。For example, a product-delivery service requires features for determining the source and destination locations of the product, tracking the product during delivery, and sending notifications to the customer.
与 该服务 的 可用性、 复原能力和 可伸缩性等属性相关，这一点与功能的要求不同。The nonfunctional requirements, in contrast, relate to properties such as the service's availability, resiliency, and scalability. 这些属性不同于功能要求，因为它们不会直接影响服务中任何特定功能的最终功能。These properties differ from the functional requirements because they don't directly affect the final function of any particular feature in the service. 但是，非功能性要求与服务的性能和连续性相关。However, nonfunctional requirements do relate to the performance and continuity of the service.
你可以在服务级别协议 (SLA) 中指定某些非功能性要求。You can specify some nonfunctional requirements in terms of a service-level agreement (SLA). 例如，你可以将服务连续性表示为可用性百分比： "可用时间 99.99%"。For example, you can express service continuity as a percentage of availability: "available 99.99 percent of the time". 其他非功能性要求可能更难定义，并可能随着生产需求的变化而变化。Other nonfunctional requirements might be more difficult to define and might change as production needs change. 例如，面向使用者的服务可能会在普及后出现意外的吞吐量要求。For example, a consumer-oriented service might face unanticipated throughput requirements after a surge of popularity.
有关复原要求的详细信息，请参阅 设计可靠的 Azure 应用程序。For more information about resiliency requirements, see Designing reliable Azure applications. 本文包括一些概念的解释，如恢复点目标 (RPO) 、恢复时间目标 (RTO) 和 SLA。That article includes explanations of concepts like recovery-point objective (RPO), recovery-time objective (RTO), and SLA.
操作健康评审流程Process for operational fitness review
维护企业服务的性能和连续性的关键是实施一个过程来进行操作评审。The key to maintaining the performance and continuity of an enterprise's services is to implement a process for operational fitness review.
此过程大致分为两个阶段。At a high level, the process has two phases. 在 先决条件阶段，建立了要求，并将其映射到支持服务。In the prerequisites phase, the requirements are established and mapped to supporting services. 此阶段很少出现：可能是每年或引入了新操作。This phase occurs infrequently: perhaps annually or when new operations are introduced. "系统必备" 阶段的输出在 " 流" 阶段 中使用。The output of the prerequisites phase is used in the flow phase. 流阶段的发生频率更高，如每月。The flow phase occurs more frequently, such as monthly.
此阶段中的步骤捕获对重要服务进行定期审查的要求。The steps in this phase capture the requirements for conducting a regular review of the important services.
确定关键业务运营。Identify critical business operations. 确定企业的任务关键型业务操作。Identify the enterprise's mission-critical business operations. 业务操作独立于任何支持服务功能。Business operations are independent from any supporting service functionality. 换句话说，业务运营代表了业务需要执行并且由一组 IT 服务支持的实际活动。In other words, business operations represent the actual activities that the business needs to perform and that are supported by a set of IT services.
术语 "关键 任务 (" 或 " 业务关键 ") 反映了会妨碍操作的严重影响。The term mission-critical (or business-critical) reflects a severe impact on the business if the operation is impeded. 例如，联机零售商可能有业务运营，如 "使客户能够向购物车添加商品" 或 "处理信用卡付款"。For example, an online retailer might have a business operation, such as "enable a customer to add an item to a shopping cart" or "process a credit card payment." 如果其中任一操作失败，则客户无法完成交易，企业无法实现销售。If either of these operations fails, a customer can't complete the transaction and the enterprise fails to realize sales.
将操作映射到服务。Map operations to services. 将关键业务操作映射到支持这些操作的服务。Map the critical business operations to the services that support them. 在购物车示例中，可能会涉及几项服务，其中包括库存库存管理服务和购物车服务。In the shopping-cart example, several services might be involved, including an inventory stock-management service and a shopping-cart service. 若要处理信用卡付款，本地付款服务可能与第三方支付处理服务交互。To process a credit-card payment, an on-premises payment service might interact with a third-party, payment-processing service.
分析服务依赖关系。Analyze service dependencies. 大多数业务操作需要多个支持服务之间的业务流程。Most business operations require orchestration among multiple supporting services. 务必了解服务之间的依赖关系，以及通过这些服务的任务关键型事务流。It's important to understand the dependencies between the services, and the flow of mission-critical transactions through these services.
还应考虑本地服务与 Azure 服务之间的依赖关系。Also consider the dependencies between on-premises services and Azure services. 在购物车示例中，库存库存管理服务可能位于本地，并引入来自物理仓库的员工输入的数据。In the shopping-cart example, the inventory stock-management service might be hosted on-premises and ingest data entered by employees from a physical warehouse. 但是，它可能会将数据存储在 Azure 服务（如 Azure 存储空间）或数据库（如 Azure Cosmos DB）中。However, it might store data off-premises in an Azure service, such as Azure Storage, or a database, such as Azure Cosmos DB.
这些活动的输出是一组适用于服务操作的 记分卡指标。An output from these activities is a set of scorecard metrics for service operations. 记分卡度量标准（如可用性、可伸缩性和灾难恢复）。The scorecard measures criteria such as availability, scalability, and disaster recovery. 记分卡指标表示期望服务满足的操作条件。Scorecard metrics express the operational criteria that you expect the service to meet. 这些度量值可在适用于服务操作的任何粒度级别上表示。These metrics can be expressed at any level of granularity that's appropriate for the service operation.
记分卡应该以简单的术语表述，方便业务负责人和工程人员进行有意义的讨论。The scorecard should be expressed in simple terms to facilitate meaningful discussion between the business owners and engineering. 例如，可伸缩性的记分卡指标可以通过简单的方式进行颜色编码。For example, a scorecard metric for scalability might be color-coded in a simple way. 绿色表示满足定义的条件，黄色表示未能满足定义的条件，但要积极实现计划的修正，而红色表示未能满足定义的条件，无计划或操作。Green means meeting the defined criteria, yellow means failing to meet the defined criteria but actively implementing a planned remediation, and red means failing to meet the defined criteria with no plan or action.
必须强调的是，这些度量值应直接反映业务需求。It's important to emphasize that these metrics should directly reflect business needs.
服务审核阶段是运营健身考核的核心。The service-review phase is the core of the operational fitness review. 它涉及到以下步骤：It involves these steps:
度量服务指标。Measure service metrics. 使用记分卡指标来监视服务，以确保服务满足业务期望。Use the scorecard metrics to monitor the services, to ensure that the services meet the business expectations. 服务监视非常重要。Service monitoring is essential. 如果无法监视与非功能性要求相关的一组服务，请考虑将相应的记分卡指标设为红色。If you can't monitor a set of services with respect to the nonfunctional requirements, consider the corresponding scorecard metrics to be red. 在这种情况下，若要进行补救，第一步是实施适当的服务监视。In this case, the first step for remediation is to implement the appropriate service monitoring. 例如，如果业务期望某个服务在99.99% 的可用性下运行，但没有适当的生产遥测来度量可用性，则假定您不符合要求。For example, if the business expects a service to operate with 99.99 percent availability, but there is no production telemetry in place to measure availability, assume that you're not meeting the requirement.
规划更正。Plan remediation. 对于其指标低于可接受阈值的每个服务操作，确定修正服务以将操作引入可接受级别的成本。For each service operation for which metrics fall below an acceptable threshold, determine the cost of remediating the service to bring operation to an acceptable level. 如果修正服务的成本高于预期的服务收入，请继续考虑无形成本，如客户体验。If the cost of remediating the service is greater than the expected revenue generation of the service, move on to consider the intangible costs, such as customer experience. 例如，如果客户在通过使用服务时遇到了困难，他们可以改为选择竞争对手。For example, if customers have difficulty placing a successful order by using the service, they might choose a competitor instead.
实现修正。Implement remediation. 企业所有者和工程团队同意计划后，实现该计划。After the business owners and engineering team agree on a plan, implement it. 查看记分卡指标时，报告实施状态。Report the status of the implementation whenever you review scorecard metrics.
此过程是迭代的，理想情况下，您的企业拥有一个专门的团队。This process is iterative, and ideally your enterprise has a team dedicated to it. 此团队应定期满足评审现有修正项目的要求，开始新工作负载的基本审查，并跟踪企业的整个记分卡。This team should meet regularly to review existing remediation projects, kick off the fundamental review of new workloads, and track the enterprise's overall scorecard. 如果团队落后于计划或未能满足指标，则团队还应拥有持有修正团队责任的权限。The team should also have the authority to hold remediation teams accountable if they're behind schedule or fail to meet metrics.
评审团队的结构Structure of the review team
负责运营健身评审的团队由以下角色组成：The team responsible for operational fitness review is composed of the following roles:
企业所有者： 提供业务知识，以确定每个关键任务业务运营并确定其优先级。Business owner: Provides knowledge of the business to identify and prioritize each mission-critical business operation. 此角色还会将缓解成本与业务影响进行比较，并推动最终决定进行修正。This role also compares the mitigation cost to the business impact, and drives the final decision on remediation.
企业拥护者： 将业务运营分解为合理部分，并将这些部件映射到服务和基础结构，无论是在本地还是在云中。Business advocate: Breaks down business operations into discreet parts, and maps those parts to services and infrastructure, whether on-premises or in the cloud. 此角色需深入了解与每项业务操作相关联的技术。The role requires deep knowledge of the technology associated with each business operation.
工程所有者： 实现与业务操作关联的服务。Engineering owner: Implements the services associated with the business operation. 这些人员可能会参与到任何解决方案的设计、实现和部署，以应对评审团队发现的非功能性要求问题。These individuals might participate in the design, implementation, and deployment of any solutions for nonfunctional requirement problems that are uncovered by the review team.
服务所有者： 操作企业的应用程序和服务。Service owner: Operates the business's applications and services. 这些人员收集这些应用程序和服务的日志记录和使用情况数据。These individuals collect logging and usage data for these applications and services. 此数据既可用于识别问题，也可用于在部署之后验证修复。This data is used both to identify problems and to verify fixes after they're deployed.
建议你的评审团队定期会面。We recommend that your review team meet on a regular basis. 例如，团队可能会满足每月的要求，然后将状态和指标报告给每季度高级领导。For example, the team might meet monthly, and then report status and metrics to senior leadership on a quarterly basis.
调整流程的详细信息并满足特定需求。Adapt the details of the process and meeting to fit your specific needs. 建议从以下任务着手：We recommend the following tasks as a starting point:
企业所有者和企业人员应为每个业务运营人员枚举和确定功能性要求，并提供工程和服务所有者的输入。The business owner and business advocate enumerate and determine the nonfunctional requirements for each business operation, with input from the engineering and service owners. 对于之前已确定的业务操作，请查看并验证优先级。For business operations that have been identified previously, review and verify the priority. 对于新的业务操作，请在现有列表中分配优先级。For new business operations, assign a priority in the existing list.
工程人员和服务负责人会将业务操作的当前状态映射到相应的本地服务和云服务。The engineering and service owners map the current state of business operations to the corresponding on-premises and cloud services. 映射是每个服务中的组件列表，以依赖关系树的形式进行。The mapping is a list of the components in each service, oriented as a dependency tree. 然后，工程和服务所有者确定通过树的关键路径。The engineering and service owners then determine the critical paths through the tree.
工程人员和服务负责人会针对上一步列出的服务来审核操作日志记录和监视的当前状态。The engineering and service owners review the current state of operational logging and monitoring for the services listed in the previous step. 可靠的日志记录和监视非常重要：它们确定导致故障的服务组件无法满足不正常的要求。Robust logging and monitoring are critical: they identify service components that contribute to a failure to meet nonfunctional requirements. 如果没有适当的日志记录和监视，则团队必须通过创建和实现计划将其放在原位。If sufficient logging and monitoring aren't in place, the team must put them in place by creating and implementing a plan.
团队为新业务运营创建记分卡指标。The team creates scorecard metrics for new business operations. 记分卡包含步骤2中标识的每个服务的构成组件列表。The scorecard consists of the list of constituent components for each service identified in step 2. 它与非功能性要求一致，并包括每个组件满足要求的程度的度量。It's aligned with the nonfunctional requirements, and includes a measure of how well each component meets the requirements.
对于无法满足功能性要求的构成组件，团队设计了一个高级解决方案，并为其分配了一个工程所有者。For constituent components that fail to meet nonfunctional requirements, the team designs a high-level solution, and assigns an engineering owner. 此时，业务所有者和企业支持部门会根据业务运营的预期收入，建立修正工作的预算。At this point, the business owner and business advocate establish a budget for the remediation work, based on the expected revenue of the business operation.
最后，团队对正在进行的修正工作进行评审。Finally, the team conducts a review of the ongoing remediation work. 正在进行的工作的每个记分卡度量值都按照预期条件进行评审。Each of the scorecard metrics for work in progress is reviewed against the expected criteria. 对于符合指标条件的构成组件，服务所有者将提供日志记录和监视数据，以确认满足条件。For constituent components that meet metric criteria, the service owner presents logging and monitoring data to confirm that the criteria are met. 对于那些不满足指标条件的构成组件，每个工程所有者都可以解释阻止满足条件的问题，并提供任何新的修正设计。For those constituent components that don't meet metric criteria, each engineering owner explains the problems that are preventing criteria from being met, and presents any new designs for remediation.
- Microsoft Azure Well-Architected 框架：了解用于提高工作负荷质量的指导原则。Microsoft Azure Well-Architected Framework: Learn about guiding tenets for improving the quality of a workload. 该框架包含卓越体系结构的五个要素：The framework consists of five pillars of architecture excellence:
- 成本优化Cost optimization
- 卓越运营Operational excellence
- 性能效率Performance efficiency
- 适用于 Azure 应用程序的10个设计原则。Ten design principles for Azure applications. 遵循这些设计原则可以提高应用程序的可伸缩性、复原能力和易管理性。Follow these design principles to make your application more scalable, resilient, and manageable.
- 设计适用于 Azure 的可复原应用程序。Designing resilient applications for Azure. 在应用程序的生存期（从设计和实施到部署和操作）使用结构化方法构建和维护可靠的系统。Build and maintain reliable systems using a structured approach over the lifetime of an application, from design and implementation to deployment and operations.
- 云设计模式。Cloud design patterns. 使用设计模式可以在体系结构的支柱上构建应用程序。Use design patterns to build applications on the pillars of architecture excellence.
- Azure Advisor。Azure Advisor. Azure 顾问根据你的使用情况和配置提供个性化的建议，以帮助优化资源以实现高可用性、安全性、性能和成本。Azure Advisor provides personalized recommendations based on your usage and configurations to help optimize your resources for high availability, security, performance, and cost.