雲端管理中的保護與復原Protect and recover in cloud management

當他們符合 清查和可見度 和作業 合規性的需求之後,雲端管理小組就可以預測並準備可能的工作負載中斷。After they've met the requirements for inventory and visibility and operational compliance, cloud management teams can anticipate and prepare for a potential workload outage. 在規劃雲端管理時,小組必須先假設某個東西將會失敗。As they're planning for cloud management, the teams must start with an assumption that something will fail.

沒有技術解決方案可以一致地提供100% 的執行時間 SLA。No technical solution can consistently offer a 100 percent uptime SLA. 具有最多冗余架構宣告以提供「六個9」或99.9999% 執行時間的解決方案。Solutions with the most redundant architectures claim to deliver on "six 9s" or 99.9999 percent uptime. 但是,即使是「六個9」解決方案,在任何指定的年份中都有31.6 秒的停機時間。But even a "six 9s" solution goes down for 31.6 seconds in any given year. 可惜的是,解決方案有很罕見的保證需要進行大量的營運投資,才能達到「六個9」的執行時間。Sadly, it's rare for a solution to warrant a large, ongoing operational investment that's required to reach "six 9s" of uptime.

為中斷做好準備,可讓小組更快偵測到失敗,並更快速地復原。Preparation for an outage allows the team to detect failures sooner and recover more quickly. 此專業領域的重點在於系統失敗後立即產生的步驟。The focus of this discipline is on the steps that come immediately after a system fails. 您要如何保護工作負載,以便在發生中斷時能夠快速復原?How do you protect workloads, so that they can be recovered quickly when an outage occurs?

翻譯保護和修復交談Translate protection and recovery conversations

商務營運的工作負載是由應用程式、資料、虛擬機器 (Vm) 和其他資產所組成。The workloads that power business operations consist of applications, data, virtual machines (VMs), and other assets. 每個資產都可能需要不同的保護和復原方法。Each of those assets might require a different approach to protection and recovery. 此專業領域的重要層面是在管理基準內建立一致的承諾,可在商務討論期間提供起點。The important aspect of this discipline is to establish a consistent commitment within the management baseline, which can provide a starting point during business discussions.

支援任何指定工作負載的每個資產至少都應該有一種基準方法,並明確地承諾復原 (復原時間目標,或 RTO) 和資料遺失 (復原點目標或 RPO) 的風險。At a minimum, each asset that supports any given workload should have a baseline approach with a clear commitment to speed of recovery (recovery time objectives, or RTO) and risk of data loss (recovery point objectives, or RPO).

復原時間目標 (RTO) Recovery time objectives (RTO)

發生嚴重損壞時,復原時間目標是將任何系統復原到嚴重損壞之前的狀態所需的時間量。When disaster strikes, a recovery time objective is the amount of time it should take to recovery any system to its state prior to the disaster. 針對每個工作負載,其中包括還原 Vm 和應用程式的最低必要功能所需的時間。For each workload, that would include the time required to restore minimum necessary functionality for the VMs and applications. 它也包括還原應用程式所需之資料所需的時間量。It also includes the amount of time required to restore the data that's required by the applications.

在商務方面,RTO 代表商務程式將不服務的時間量。In business terms, RTO represents the amount of time that the business process will be out of service. 針對任務關鍵性工作負載,此變數應該相對較低,以便讓商務程式快速恢復。For mission-critical workloads, this variable should be relatively low, allowing the business processes to resume quickly. 針對較低優先順序的工作負載,標準的 RTO 層級可能不會對公司效能造成明顯的影響。For lower-priority workloads, a standard level of RTO might not have a noticeable impact on company performance.

管理基準應該為非任務關鍵性工作負載建立標準的 RTO。The management baseline should establish a standard RTO for non-mission-critical workloads. 然後,企業可以使用該基準來證明復原時間的額外投資。The business can then use that baseline as a way to justify additional investments in recovery times.

復原點目標 (RPO)Recovery point objectives (RPO)

在大部分的雲端管理系統中,資料是透過某種形式的資料保護來定期捕獲和儲存。In most cloud management systems, data is periodically captured and stored through some form of data protection. 上次捕獲資料的時間稱為復原點。The last time data was captured is referred to as a recovery point. 當系統失敗時,它只能還原到最近的復原點。When a system fails, it can be restored only to the most recent recovery point.

如果系統的復原點目標是以小時或天為單位來測量,系統失敗就會導致在最後一個復原點與中斷之間的時間或天數內遺失資料。If a system has a recovery point objective that's measured in hours or days, a system failure would result in the loss of data for those hours or days between the last recovery point and the outage. 理論上,一天的 RPO 會導致在一天內遺失所有交易,而導致失敗。A one-day RPO would theoretically result in the loss of all transactions in the day leading up to the failure.

針對要徑任務系統,以分鐘或秒數測量的 RPO 可能更適合用來避免收益損失。For mission-critical systems, an RPO that's measured in minutes or seconds might be more appropriate to use to avoid a loss in revenue. 但較短的 RPO 通常會導致整體管理成本增加。But a shorter RPO generally results in an increase in overall management costs.

為了協助將成本降至最低,管理基準應著重于可接受的最長 RPO。To help minimize costs, a management baseline should focus on the longest acceptable RPO. 然後,雲端管理小組可以增加特定平臺或工作負載的 RPO,進而獲得更多投資。The cloud management team can then increase the RPO of specific platforms or workloads, which would warrant more investment.

保護和復原工作負載Protect and recover workloads

IT 環境中大部分的工作負載都支援特定的商務或技術程式。Most of the workloads in an IT environment support a specific business or technical process. 對商務營運沒有系統影響的系統,通常不保證快速復原所需的投資,或將遺失的資料降到最低。Systems that don't have a systemic impact on business operations often don't warrant the increased investments required to recover quickly or minimize data loss. 藉由建立基準,企業可以清楚瞭解可在一致且可管理的價位上提供的復原支援層級。By establishing a baseline, the business can clearly understand what level of recovery support can be offered at a consistent, manageable price point. 這項理解有助於商務專案關係人評估復原投資增加的價值。This understanding helps the business stakeholders evaluate the value of an increased investment in recovery.

對於大部分的雲端管理小組來說,針對各種資產具有特定 RPO/RTO 承諾的增強基準,會產生最有利的雙向商務承諾途徑。For most cloud management teams, an enhanced baseline with specific RPO/RTO commitments for various assets yields the most favorable path to mutual business commitments. 下列各節概述一些常見的增強基準,可讓企業透過可重複的程式輕鬆新增保護和復原功能。The following sections outline a few common enhanced baselines that empower the business to easily add protection and recovery functionality through a repeatable process.

保護和復原資料Protect and recover data

資料是數位經濟中最有價值的資產。Data is arguably the most valuable asset in the digital economy. 更有效地保護和復原資料的能力是最常見的增強基準。The ability to protect and recover data more effectively is the most common enhanced baseline. 針對生產工作負載所需的資料,可能會直接等同于資料遺失,以因應收益或失去獲利。For the data that powers a production workload, loss of data can be directly equated to loss in revenue or loss of profitability. 我們通常會建議雲端管理小組提供支援常見資料平臺的增強式管理基準層級。We generally encourage cloud management teams to offer a level of enhanced management baseline that supports common data platforms.

在雲端管理團隊執行平臺作業之前,其通常是為了支援平臺即服務 (PaaS) 資料平臺的改善作業。Before cloud management teams implement platform operations, it's common for them to support improved operations for a platform as a service (PaaS) data platform. 例如,雲端管理小組很容易針對 Azure SQL Database 或 Azure Cosmos DB 解決方案強制執行較高的備份或多區域複寫頻率。For instance, it's easy for a cloud management team to enforce a higher frequency of backup or multiregion replication for Azure SQL Database or Azure Cosmos DB solutions. 這麼做可讓開發小組藉由現代化其資料平臺,輕鬆地改善 RPO。Doing so allows the development team to easily improve RPO by modernizing their data platforms.

若要深入瞭解此思考流程,請參閱 平臺作業專業領域To learn more about this thought process, see Platform operations discipline.

保護和復原 VmProtect and recover VMs

大部分的工作負載都有虛擬機器的相依性,其裝載解決方案的各種層面。Most workloads have some dependency on virtual machines, which host various aspects of the solution. 若要讓工作負載在系統失敗後支援商務程式,必須快速復原部分虛擬機器。For the workload to support a business process after a system failure, some virtual machines must be recovered quickly.

這些虛擬機器上每分鐘的停機時間可能會導致收入損失或降低獲利。Every minute of downtime on those virtual machines could cause lost revenue or reduced profitability. 當 VM 停機時間對企業的會計績效有直接的影響時,RTO 非常重要。When VM downtime has a direct impact on the fiscal performance of the business, RTO is very important. 您可以使用複寫至次要網站和自動復原(稱為熱復原模式的模型)來更快速地復原虛擬機器。Virtual machines can be recovered more quickly by using replication to a secondary site and automated recovery, a model that's referred to as a hot-warm recovery model. 在復原的最高狀態下,可以將虛擬機器複寫至功能完整的次要網站。At the highest state of recovery, virtual machines can be replicated to a fully functional, secondary site. 這種更昂貴的方法稱為高可用性或經常性存取復原模式。This more expensive approach is referred to as a high-availability, or hot-hot, recovery model.

上述每個模型都會減少 RTO,進而加快商務程式功能的還原速度。Each of the preceding models reduces the RTO, resulting in a faster restoration of business process capabilities. 不過,每個模型也會導致雲端管理成本大幅增加。However, each model also results in significantly increased cloud management costs.

此外,請注意,除了複寫以提供高可用性之外,也應該針對意外刪除、資料損毀及勒索軟體攻擊等案例啟用備份。Also, please note that, apart from replication for high-availability, backup should be enabled for scenarios such as accidental delete, data corruption and ransomware attacks.

如需這個思考流程的詳細資訊,請參閱 工作負載作業專業領域For more information about this thought process, see Workload operations discipline.

下一步Next steps

符合此管理基準元件之後,小組就可以繼續進行,以避免其 平臺作業工作負載作業中斷。After this management baseline component is met, the team can look ahead to avoid outages in its platform operations and workload operations.