您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

云管理中的保护和恢复Protect and recover in cloud management

在满足 清单、可见性操作合规性要求后,云管理团队便可以预测并准备可能的工作负荷中断。After they've met the requirements for inventory and visibility and operational compliance, cloud management teams can anticipate and prepare for a potential workload outage. 在规划云管理时,团队必须首先假设某些事情会失败。As they're planning for cloud management, the teams must start with an assumption that something will fail.

任何技术解决方案都不能始终如一地提供100% 的正常运行时间 SLA。No technical solution can consistently offer a 100 percent uptime SLA. 具有最多冗余体系结构声明的解决方案,可提供 "六个 9" 或99.9999% 的正常运行时间。Solutions with the most redundant architectures claim to deliver on "six 9s" or 99.9999 percent uptime. 但在任何给定年份,甚至 "六个 9" 解决方案也会下降31.6 秒。But even a "six 9s" solution goes down for 31.6 seconds in any given year. 不幸的是,解决方案很少能保证在运行时间 "六个 9" 时需要进行大量的运营投资。Sadly, it's rare for a solution to warrant a large, ongoing operational investment that's required to reach "six 9s" of uptime.

为中断做好准备,使团队能够更快地检测故障,并更快地恢复。Preparation for an outage allows the team to detect failures sooner and recover more quickly. 此训练的重点是在系统出现故障后立即产生的步骤。The focus of this discipline is on the steps that come immediately after a system fails. 如何保护工作负荷,以便在发生故障时能够快速恢复它们?How do you protect workloads, so that they can be recovered quickly when an outage occurs?

翻译保护和恢复对话Translate protection and recovery conversations

业务运营工作负荷包括应用程序、数据、虚拟机 (Vm) 和其他资产)。The workloads that power business operations consist of applications, data, virtual machines (VMs), and other assets. 其中每个资产可能都需要使用不同的保护和恢复方法。Each of those assets might require a different approach to protection and recovery. 此规则的一个重要方面是在管理基线内建立一致的承诺,在业务讨论过程中,这可以提供一个起点。The important aspect of this discipline is to establish a consistent commitment within the management baseline, which can provide a starting point during business discussions.

至少,每个支持给定工作负荷的资产都应有一种基准方法,明确承诺恢复 (恢复时间目标,或 RTO) ,并 (恢复点目标或 RPO) 的数据丢失风险。At a minimum, each asset that supports any given workload should have a baseline approach with a clear commitment to speed of recovery (recovery time objectives, or RTO) and risk of data loss (recovery point objectives, or RPO).

恢复时间目标 (RTO) Recovery time objectives (RTO)

发生灾难时,恢复时间目标是在灾难发生之前将任何系统恢复到其状态所需的时间。When disaster strikes, a recovery time objective is the amount of time it should take to recovery any system to its state prior to the disaster. 对于每个工作负荷,都包含为 Vm 和应用程序还原最小必要功能所需的时间。For each workload, that would include the time required to restore minimum necessary functionality for the VMs and applications. 它还包括还原应用程序所需的数据所需的时间。It also includes the amount of time required to restore the data that's required by the applications.

在业务术语中,RTO 表示业务流程将停止服务的时间。In business terms, RTO represents the amount of time that the business process will be out of service. 对于任务关键型工作负荷,此变量应相对较低,这使得业务流程可以快速恢复。For mission-critical workloads, this variable should be relatively low, allowing the business processes to resume quickly. 对于低优先级的工作负载,标准的 RTO 级别可能不会对公司性能产生显著影响。For lower-priority workloads, a standard level of RTO might not have a noticeable impact on company performance.

管理基线应为非任务关键型工作负荷建立标准 RTO。The management baseline should establish a standard RTO for non-mission-critical workloads. 然后,业务可以使用该基线来论证恢复时间的额外投资。The business can then use that baseline as a way to justify additional investments in recovery times.

恢复点目标 (RPO)Recovery point objectives (RPO)

在大多数云管理系统中,会定期捕获数据,并使用某种形式的数据保护来存储数据。In most cloud management systems, data is periodically captured and stored through some form of data protection. 上次捕获数据的时间称为恢复点。The last time data was captured is referred to as a recovery point. 当系统发生故障时,只能将其还原到最近的恢复点。When a system fails, it can be restored only to the most recent recovery point.

如果系统具有以小时或天计量的恢复点目标,则系统故障会导致数据丢失,超过最后一个恢复点和服务中断的时间。If a system has a recovery point objective that's measured in hours or days, a system failure would result in the loss of data for those hours or days between the last recovery point and the outage. 理论上,一天 RPO 会导致失败的那一天发生的所有事务丢失。A one-day RPO would theoretically result in the loss of all transactions in the day leading up to the failure.

对于任务关键型系统,以分钟或秒为单位的 RPO 可能更适合用于避免收入损失。For mission-critical systems, an RPO that's measured in minutes or seconds might be more appropriate to use to avoid a loss in revenue. 但较短的 RPO 通常会提高总体管理成本。But a shorter RPO generally results in an increase in overall management costs.

为了帮助最大程度地降低成本,管理基准应侧重于可接受的最长 RPO。To help minimize costs, a management baseline should focus on the longest acceptable RPO. 然后,云管理团队可以提高特定平台或工作负荷的 RPO,这将保证更多投资。The cloud management team can then increase the RPO of specific platforms or workloads, which would warrant more investment.

保护和恢复工作负荷Protect and recover workloads

IT 环境中的大部分工作负荷都支持特定的业务或技术流程。Most of the workloads in an IT environment support a specific business or technical process. 对于不影响业务运营的系统,通常不能保证快速恢复所需的投资,也不会使数据丢失降到最低。Systems that don't have a systemic impact on business operations often don't warrant the increased investments required to recover quickly or minimize data loss. 通过建立基准,企业可以清楚地了解可以按一致的可管理价位提供何种级别的恢复支持。By establishing a baseline, the business can clearly understand what level of recovery support can be offered at a consistent, manageable price point. 此理解有助于业务利益干系人评估增加的恢复投资价值。This understanding helps the business stakeholders evaluate the value of an increased investment in recovery.

对于大多数云管理团队而言,具有针对各种资产的特定 RPO/RTO 承诺的增强基准会产生最有利的共同业务承诺途径。For most cloud management teams, an enhanced baseline with specific RPO/RTO commitments for various assets yields the most favorable path to mutual business commitments. 以下部分概述了几个常见的增强基准,使企业能够通过可重复的过程轻松地添加保护和恢复功能。The following sections outline a few common enhanced baselines that empower the business to easily add protection and recovery functionality through a repeatable process.

保护和恢复数据Protect and recover data

数据是数字经济中最有价值的资产。Data is arguably the most valuable asset in the digital economy. 能够更有效地保护和恢复数据是最常用的增强基线。The ability to protect and recover data more effectively is the most common enhanced baseline. 对于为生产工作负荷提供支持的数据,数据丢失可能会直接等同,从而导致收入损失或盈利损失。For the data that powers a production workload, loss of data can be directly equated to loss in revenue or loss of profitability. 我们通常鼓励云管理团队提供支持通用数据平台的增强的管理基线。We generally encourage cloud management teams to offer a level of enhanced management baseline that supports common data platforms.

在云管理团队实施平台操作之前,他们通常会支持 (PaaS) 数据平台的平台即服务的改进操作。Before cloud management teams implement platform operations, it's common for them to support improved operations for a platform as a service (PaaS) data platform. 例如,云管理团队可以轻松地对 Azure SQL 数据库或 Azure Cosmos DB 解决方案强制实施更高的备份或多区域复制。For instance, it's easy for a cloud management team to enforce a higher frequency of backup or multiregion replication for Azure SQL Database or Azure Cosmos DB solutions. 这样一来,开发团队就可以通过现代化其数据平台来轻松改进 RPO。Doing so allows the development team to easily improve RPO by modernizing their data platforms.

若要了解有关此假设过程的详细信息,请参阅 平台操作训练To learn more about this thought process, see Platform operations discipline.

保护和恢复 VmProtect and recover VMs

大多数工作负荷都依赖于虚拟机,这些虚拟机托管了解决方案的各个方面。Most workloads have some dependency on virtual machines, which host various aspects of the solution. 要使工作负荷在系统发生故障后支持业务流程,必须快速恢复一些虚拟机。For the workload to support a business process after a system failure, some virtual machines must be recovered quickly.

这些虚拟机上每分钟的停机时间可能会导致收入损失或盈利率下降。Every minute of downtime on those virtual machines could cause lost revenue or reduced profitability. 当 VM 停机对业务的财政绩效产生直接影响时,RTO 非常重要。When VM downtime has a direct impact on the fiscal performance of the business, RTO is very important. 使用复制到辅助站点和自动恢复(称为热恢复模式的模型),可以更快地恢复虚拟机。Virtual machines can be recovered more quickly by using replication to a secondary site and automated recovery, a model that's referred to as a hot-warm recovery model. 在恢复的最高状态下,可以将虚拟机复制到功能齐全的辅助站点。At the highest state of recovery, virtual machines can be replicated to a fully functional, secondary site. 这种更昂贵的方法称为高可用性或热恢复模式。This more expensive approach is referred to as a high-availability, or hot-hot, recovery model.

上述每个模型都会降低 RTO,导致更快地恢复业务流程功能。Each of the preceding models reduces the RTO, resulting in a faster restoration of business process capabilities. 但是,每个模型还会导致云管理成本大幅增加。However, each model also results in significantly increased cloud management costs.

另外,请注意,除了用于实现高可用性的复制以外,还应为意外删除、数据损坏和勒索软件攻击等方案启用备份。Also, please note that, apart from replication for high-availability, backup should be enabled for scenarios such as accidental delete, data corruption and ransomware attacks.

有关此过程的详细信息,请参阅 工作负荷操作规则For more information about this thought process, see Workload operations discipline.

后续步骤Next steps

满足此管理基线组件后,团队可以更好地了解其 平台操作工作负荷操作的中断。After this management baseline component is met, the team can look ahead to avoid outages in its platform operations and workload operations.