您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

运营设计Design for operations

合理设计应用程序,使运营团队获得所需的工具。Design an application so that the operations team has the tools they need

云已经显著改变了运营团队的角色。The cloud has dramatically changed the role of the operations team. 他们不再负责管理托管应用程序的硬件和基础结构。They are no longer responsible for managing the hardware and infrastructure that hosts the application. 即便如此,运营仍是成功运行云应用程序的关键环节。That said, operations is still a critical part of running a successful cloud application. 运营团队的一些重要功能包括:Some of the important functions of the operations team include:

  • 部署Deployment
  • 监视Monitoring
  • 升级Escalation
  • 事件响应Incident response
  • 安全审核Security auditing

可靠的记录和跟踪对于云应用程序非常重要。Robust logging and tracing are particularly important in cloud applications. 邀请运营团队参与设计和规划,确保应用程序向他们提供了成功所需的数据和见解。Involve the operations team in design and planning, to ensure the application gives them the data and insight thay need to be successful.

建议Recommendations

确保可以观测到所有内容Make all things observable. 部署和运行解决方案后,日志记录和跟踪的结果将是对系统的主要见解。Once a solution is deployed and running, logs and traces are your primary insight into the system. “跟踪”就是记录系统中的路径,有助于找出瓶颈、性能问题和故障点。Tracing records a path through the system, and is useful to pinpoint bottlenecks, performance issues, and failure points. “记录”就是捕获单个事件,例如应用程序状态更改、错误和异常。Logging captures individual events such as application state changes, errors, and exceptions. 请在生产时记录,否则将在最需要它的时候缺乏见解。Log in production, or else you lose insight at the very times when you need it the most.

用于监视的手段Instrument for monitoring. 通过监视可了解应用程序在可用性、性能和系统运行状况方面的表现是否良好。Monitoring gives insight into how well (or poorly) an application is performing, in terms of availability, performance, and system health. 例如,监视可指示是否符合 SLA。For example, monitoring tells you whether you are meeting your SLA. 在系统的常规运行期间都会进行监视。Monitoring happens during the normal operation of the system. 应尽可能实时监视,以便操作人员可以迅速对问题作出反应。It should be as close to real-time as possible, so that the operations staff can react to issues quickly. 理想情况下,监视可在导致严重故障之前,帮助避免问题的出现。Ideally, monitoring can help avert problems before they lead to a critical failure. 有关详细信息,请参阅监视和诊断For more information, see Monitoring and diagnostics.

用于根本原因分析的手段Instrument for root cause analysis. 根本原因分析是查找故障的根本原因的过程。Root cause analysis is the process of finding the underlying cause of failures. 它发生在故障出现后。It occurs after a failure has already happened.

使用分布式跟踪Use distributed tracing. 使用专为并发、异步和云规模设计的分布式跟踪系统。Use a distributed tracing system that is designed for concurrency, asynchrony, and cloud scale. 跟踪应包括跨服务边界的关联 ID。Traces should include a correlation ID that flows across service boundaries. 单个操作可能涉及对多个应用程序服务的调用。A single operation may involve calls to multiple application services. 如果操作失败,关联 ID 可帮助找出失败的原因。If an operation fails, the correlation ID helps to pinpoint the cause of the failure.

将日志和指标标准化Standardize logs and metrics. 运营团队需要在解决方案中聚合来自各种服务的日志。The operations team will need to aggregate logs from across the various services in your solution. 如果每种服务使用各自的日志格式,将很难或不可能从中获取有用的信息。If every service uses its own logging format, it becomes difficult or impossible to get useful information from them. 定义包括关联 ID、事件名称、发送者 IP 地址等字段的常见架构。Define a common schema that includes fields such as correlation ID, event name, IP address of the sender, and so forth. 单个服务可以派生继承基础架构并包含附加字段的自定义架构。Individual services can derive custom schemas that inherit the base schema, and contain additional fields.

自动化管理任务,包括预配、部署和监视。Automate management tasks, including provisioning, deployment, and monitoring. 自动化任务具有可重复性并且可以减少人为错误。Automating a task makes it repeatable and less prone to human errors.

将配置视为代码Treat configuration as code. 通过将配置文件签入版本控制系统,可以对更改进行跟踪和版本控制,并在需要时回滚。Check configuration files into a version control system, so that you can track and version your changes, and roll back if needed.