SharePoint Server 的高可用性與災害復原概念High availability and disaster recovery concepts in SharePoint Server

摘要:了解 SharePoint Server 2016 和 SharePoint 2013 的高可用性及災害復原概念,以便為伺服器陣列選擇最佳策略。Summary: Understand high availability and disaster recovery concepts in SharePoint Server 2016 and SharePoint 2013 so you can choose the best strategy for your farm.

當您在建立 SharePoint Server 伺服器陣列的計畫和系統規格時,高可用性及災害復原為最高優先順序。如果陣列伺服器不具高可用性或伺服器陣列無法進行復原,其他規劃層面 (例如高效能及容量) 便無效。High availability and disaster recovery is the highest priority when you create a plan and system specifications for a SharePoint Server farm. Other aspects of the plan, such as high performance and capacity, are negated if farm servers are not highly available or a farm cannot be recovered.

若要設計和實作可使保持作業效率及連續性的有效策略,您應該了解高可用性及災害復原的基本概念。這些概念對於評估 SharePoint 環境與挑選其最佳技術解決方案亦至關重要。To design and implement an effective strategy that maintains efficient and uninterrupted operations, you should understand the basic concepts of high availability and disaster recovery. These concepts are also important to evaluate and pick the best technical solutions for your SharePoint environment.

營運持續力管理簡介Introduction to business continuity management

商務持續性管理是一個管理程序或計畫,可定義、評估並協助您管理持續執行組織的風險。商務持續性管理著重在建立和維護商務持續性計畫,這是各種狀況中斷正常商業營運時持續性作業的藍圖。這些狀況可能是自然、人為或兩者的組合。持續性計畫衍生自下列分析和輸入:Business continuity management is a management process or program that defines, assesses, and helps manage the risks to the continued running of an organization. Business continuity management focuses on creating and maintaining a business continuity plan, which is a roadmap for continuing operations when normal business operations are interrupted by adverse conditions. These conditions can be natural, man-made, or a combination of both. A continuity plan is derived from the following analyses and inputs:

  • 業務影響分析A business impact analysis

  • 威脅及風險分析A threat and risk analysis

  • 影響案例的定義A definition of the impact scenarios

  • 一組經過記載的復原需求A set of documented recovery requirements

其結果為:解決方案設計或經過識別的選項、實作計畫、測試與組織接受度計畫以及維護計畫或排程。The result is a solution design or identified options, an implementation plan, a testing and organization acceptance plan, and a maintenance plan or schedule.

商務持續性管理範例是 資料及應用程式的災害復原和保護,可提供 Microsoft 商務持續性計畫的快照。An example of business continuity management is Disaster recovery and protection for data and applications, which provides a snapshot of the business continuity program at Microsoft.

資訊技術 (IT) 顯然是許多組織營運持續力規劃的重要層面。然而,營運持續力所涵蓋的層面更廣,包含確保組織可在重大干擾事件發生期間或之後立即繼續營運所需的全部作業。營運持續力計畫包含但不限於下列元素:Obviously Information Technology (IT) is a significant aspect of business continuity planning for many organizations. However, business continuity is more encompassing - it includes all the operations that are needed to make sure that an organization can continue to do business during and immediately after a major disruptive event. A business continuity plan includes, but is not limited to, the following elements:

  • 原則、流程及程序policies, processes and procedures

  • 可能的選項及決策責任possible options and decision-making responsibility

  • 人力資源與設施human resources and facilities

  • 資訊技術information technology

雖然高可用性和災害復原通常等同於「營運持續力管理」,其實,它們是「營運持續力管理」的子集。Although high availability and disaster recovery are often equated to business continuity management; they are in fact, subsets of business continuity management.

說明高可用性Describing high availability

針對特定軟體應用程式或服務,最後會根據使用者的體驗和期望來評估高可用性。停機時間對業務造成的具體影響及感受的影響,可就資訊遺失、財產損失、生產力下降、機會成本、違約損害賠償或商譽受損等方面來表示。For a given software application or service, high availability is ultimately measured in terms of the end user's experience and expectations. The tangible and perceived business impact of downtime may be expressed in terms of information loss, property damage, decreased productivity, opportunity costs, contractual damages, or the loss of goodwill.

高可用性解決方案的主要目標是將停機時間的所造成的影響降至最低。 是一個取得營運程序之最佳平衡的完備策略及包含技術能力和基礎結構成本之服務等級協定 (SLA)。The principal goal of a high availability solution is to minimize or mitigate the impact of downtime. A sound strategy for this optimally balances business processes and Service Level Agreements (SLAs) with technical capabilities and infrastructure costs.

根據合約、客戶期望及關係人,平台被視為具有高可用性。系統的可用性可按此計算方式表示:A platform is considered highly available per the agreement and expectations of customers and stakeholders. The availability of a system can be expressed as this calculation:

實際上線時間/預定上線時間 X 100%Actual uptime/Expected uptime X 100%

業界會根據解決方案所提供之數字 9 的數目來表示結果的值;意即表達年度之可能上線時間的分鐘數,或反之,停機時間的分鐘數。The resulting value is often expressed by industry in terms of the number of 9's that the solution provides; meant to convey an annual number of minutes of possible uptime, or conversely, minutes of downtime.

數字 9 的數目Number of 9's 可用性百分比Availability Percentage 年度總停機時間Total Annual Downtime
22
99%99%
3 天 15 小時3 days, 15 hours
33
99.9%99.9%
8 小時 45 分鐘8 hours, 45 minutes
44
99.99%99.99%
52 分鐘 34 秒52 minutes, 34 seconds
55
99.999%99.999%
5 分鐘 15 秒5 minutes, 15 seconds

預訂和意外的停機時間Planned versus unplanned downtime

預期性或預訂的系統中斷,或為意外故障的結果。如果進行適當的管理,則無須以負面角度看待停機時間。可預知的停機時間有兩種主要的類型:System outages are either anticipated or planned for, or they are the result of an unplanned failure. Downtime need not be considered negatively if it is appropriately managed. There are two key types of foreseeable downtime:

  • 預定進行的維修作業。會針對預定進行的維修工作事先公布及協調時間範圍,例如:軟體修補、硬體升級、密碼更新、重新編製索引離線作業、資料載入或災害復原程序的預演。謹慎、妥善管理的作業程序可將停機時間降至最低,並防止遺失任何資料。預定進行的維修作業活動,可視為防止或降低其他更嚴重之潛在意外中斷狀況所需的投資。Planned maintenance. A time window is preannounced and coordinated for planned maintenance tasks such as software patching, hardware upgrades, password updates, offline re-indexing, data loading, or the rehearsal of disaster recovery procedures. Deliberate, well-managed operational procedures should minimize downtime and prevent any data loss. Planned maintenance activities can be seen as investments needed to prevent or mitigate other potentially more severe unplanned outage scenarios.

  • 意外的中斷。系統層級、基礎結構或處理程序失敗的發生可能是非計畫性的或無法控制的,又或者是可預見、但被認為不太可能發生或被視為其影響在可接受範圍內。妥善的高可用性解決方案可偵測這些故障的類型,自動從中斷狀態復原,然後重新建立容錯。Unplanned outage. System-level, infrastructure, or process failures may occur that are unplanned or uncontrollable, or that are foreseeable, but considered either too unlikely to occur, or are considered to have an acceptable impact. A robust high availability solution detects these types of failures, automatically recovers from the outage, and then reestablishes fault tolerance.

在建立高可用性之 SLA 時,您應該針對預定進行的維修作業和意外停機計算個別關鍵效能指標 (KPI)。此方法可讓您根據避免意外停機之利益來對比預定進行之維修作業的投資。When establishing SLAs for high availability, you should calculate separate key performance indicators (KPIs) for planned maintenance activities and unplanned downtime. This approach allows you to contrast your investment in planned maintenance activities against the benefit of avoiding unplanned downtime.

降級的可用性Degraded availability

高可用性不應被視為一個極端的前提。作為一個系統完全中斷的替代方法,使用者通常可接受系統僅部分可用,或者具有限功能或效能降低。這些不同程度的可用性包括:High availability should not be considered as an all-or-nothing proposition. As an alternative to a complete outage, it is often acceptable to the end user for a system to be partially available, or to have limited functionality or degraded performance. These varying degrees of availability include:

  • 唯讀及延遲的作業。在維護時間範圍或階段性災害復原期間內,仍可進行資料擷取,但新的工作流程及背景處理可能會暫時停止或排入佇列。Read-only and deferred operations. During a maintenance window, or during a phased disaster recovery, data retrieval is still possible, but new workflows and background processing may be temporarily halted or queued.

  • 資料延遲及應用程式回應性。由於工作負載過高、待處理項目積壓或部分平台失敗,有限的硬體資源可能會過量使用或不足。使用者體驗可能會降低,但工作仍可以生產力較低的方式完成。Data latency and application responsiveness. Due to a heavy workload, a processing backlog, or a partial platform failure, limited hardware resources may be over-committed or under-sized. User experience may suffer, but work may still get done in a less productive manner.

  • 部分、暫時性或即將發生的失敗。應用程式邏輯或硬體堆疊中的健全程度會根據發生的錯誤重試或自行修正。這些問題的類型會以延遲或應用程式回應性不佳呈現在使用者面前。Partial, transient, or impending failures. Robustness in the application logic or hardware stack that retries or self-corrects upon encountering an error. These types of issues may appear to the end user as data latency or poor application responsiveness.

  • 部分端對端失敗。在解決方案堆疊的垂直層 (基礎結構、平台及應用程式) 範圍之間,或在水平層之不同功能元件之間,可能會正常發生預訂或意外的中斷。根據功能或元件所受到的影響,使用者可能會遭遇作業部分成功或效能降低的情況。Partial end-to-end failure. Planned or unplanned outages may occur gracefully within vertical layers of the solution stack (infrastructure, platform, and application), or horizontally between different functional components. Users may experience partial success or degradation, depending upon the features or components that are affected.

對這些不甚理想情況的接受度應被視為導致完全中斷之可用性降低的一部分,以及視為為階段性災害復原中的中繼步驟。The acceptability of these suboptimal scenarios should be considered as part of a spectrum of degraded availability leading up to a complete outage, and as intermediate steps in a phased disaster recovery.

量化停機時間Quantifying downtime

當停機時間發生時,不論是預訂或意外停機,首要的業務目標是讓系統回復上線,並將資料遺失降至最低。停機時間的每一分鐘都會造成直接成本或間接成本的損失。因應意外停機時間,您必須在時間和所需做的努力取得平衡,以判斷中斷發生的原因、系統目前狀態為何,以及從中斷狀態復原的所需步驟有哪些。When downtime does occur, either planned, or unplanned, the primary business goal is to bring the system back online and minimize data loss. Every minute of downtime has direct and indirect costs. With unplanned downtime, you must balance the time and effort needed to determine why the outage occurred, what the current system state is, and what steps are needed to recover from the outage.

在任何中斷的預先判定點,您應該做出或尋求商業決策以停止調查中斷或執行維護工作、透過讓系統回復上線而從中斷復原,此外,請視需要重新建立容錯能力。At a predetermined point in any outage, you should make or seek the business decision to stop investigating the outage or performing maintenance tasks, recover from the outage by bringing the system back online, and if needed, reestablish fault tolerance.

復原目標Recovery objectives

資料備援是高可用性資料庫解決方案的主要元件。主要 SQL Server 執行個體上的交易活動會同步或非同步套用至一個或多個次要執行個體。當中斷發生時,正在進行中的交易可能會復原,或可能因為資料傳播延遲而在次要執行個體上遺失。Data redundancy is a key component of a high availability database solution. Transactional activity on your primary SQL Server instance is synchronously or asynchronously applied to one or more secondary instances. When an outage occurs, transactions that were in flight may be rolled back, or they may be lost on the secondary instances due to delays in data propagation.

您可評估影響,並根據回復營運所需時間及最後交易復原需要延遲多少時間來設定復原目標:You can both measure the impact, and set recovery goals in terms of how long it takes to get back in business, and how much time latency there is in the last transaction recovered:

  • 目標復原時間 (RTO)。此為中斷持續時間。初期目標是為了讓系統回復上線狀態 (至少具唯讀功能),以利進行失敗的調查。但是,主要目標是將所有服務還原至新交易可以進行的點。Recovery Time Objective (RTO). This is the duration of the outage. The initial goal is to get the system back online in at least a read-only capacity to facilitate investigation of the failure. However, the primary goal is to restore full service to the point that new transactions can take place.

  • 目標復原時點 (RPO)。通常指的是可接受資料遺失的評估。此為失敗之前最後認可的資料交易與失敗之後最近復原之資料間的時間間隔或延遲。實際資料遺失會依失敗時間系統上工作負載、失敗類型,以及所使用之高可用性解決方案類型而有很大的不同。Recovery Point Objective (RPO). This is often referred to as a measure of acceptable data loss. It is the time gap or latency between the last committed data transaction before the failure and the most recent data recovered after the failure. The actual data loss can vary depending upon the workload on the system at the time of the failure, the type of failure, and the type of high availability solution used.

    注意

    [!附註] 相關的目標即為 目標復原層級 (RLO)。是定義您必須能復原資料之精確度的目標,例如您必須能復原整個伺服器陣列、Web 應用程式、網站集合、網站、清單或文件庫,或項目。如需詳細資訊,請參閱 在 SharePoint Server 中規劃備份和修復A related objective is Recovery level objective (RLO). This objective defines the granularity with which you must be able to recover data — whether you must be able to recover the whole farm, Web application, site collection, site, list or library, or item. For more information, see Plan for backup and recovery in SharePoint Server.

您應使用 RTO 及 RPO 值作為表示停機時間及可接受資料遺失之業務容忍的目標,以及作為監視可用性狀況的指標。You should use RTO and RPO values as goals that indicate business tolerance for downtime and acceptable data loss, and as metrics for monitoring availability health.

調整 ROI 或機會成本Justifying ROI or opportunity cost

停機時間的業務成本可能為財務上的或是以客戶信譽的形式。這些成本可能會隨時間而增加,或在中斷時間範圍的某個時間點產生。除了透過指定的復原時間及資料復原點來推斷發生中斷所花的成本,您也可以計算業務程序及基礎結構所需的投資,以達到您的 RTO 及 RPO 目標,或避免同時發生中斷。這些投資應包含:The business costs of downtime may be either financial or in the form of customer goodwill. These costs may accrue with time, or they may be incurred at a certain point in the outage window. In addition to projecting the cost of incurring an outage with a given recovery time and data recovery point, you can also calculate the business process and infrastructure investments needed to attain your RTO and RPO goals or to avoid the outage all together. These investment themes should include:

  • 避免停機時間。如果中斷未發生在一開始,便可一併規避中斷復原成本的支出。投資包含以下項目的成本:容錯及備援硬體或基礎結構、在獨立的失效點分散工作負載和預訂的停機時間以進行維護作業。Avoiding downtime. Outage recovery costs are avoided all together if an outage doesn't occur in the first place. Investments include the cost of fault-tolerant and redundant hardware or infrastructure, distributing workloads across isolated points of failure, and planned downtime for preventive maintenance.

  • 自動復原。如果系統發生失效狀況,您可透過自動且透明之復原大幅降低停機時間對使用者體驗的影響。Automating recovery. If a system failure occurs, you can greatly mitigate the impact of downtime on the customer experience through automatic and transparent recovery.

  • 資源使用情況。次要或待命基礎結構可閒置以備中斷時使用。這也可以用於唯讀工作負載,或藉由分散所有可用硬體的工作負載來改善整體系統效能。Resource utilization. Secondary or standby infrastructure can sit idle, awaiting an outage. It also can be leveraged for read-only workloads, or to improve overall system performance by distributing workloads across all available hardware.

對於指定的 RTO 及 RPO 目標,其所需的可用性和復原投資 (結合停機時間的保護成本) 可以時間函數表示並加以調整。在實際中斷期間,這可讓您根據過去的停機時間做出成本上的決策。For given RTO and RPO goals, the needed availability and recovery investments, combined with the projected costs of downtime, can be expressed and justified as a function of time. During an actual outage, this allows you to make cost-based decisions based on the elapsed downtime.

監視可用性狀況Monitoring availability health

就作業的觀點而言,在實際中斷期間,您不應嘗試考慮所有相關變數,以及計算即時 ROI 或機會成本。相反的,您應監視待命執行個體 (作為預期 RPO 之 Proxy) 上的資料延遲。From an operational point of view, during an actual outage, you should not attempt to consider all relevant variables and calculate ROI or opportunity costs in real time. Instead, you should monitor data latency on your standby instances as a proxy for expected RPO.

在中斷事件發生時,您亦應在中斷期間限制一開始花費在調查根本原因的時間,並改為著重於驗證復原環境的狀況,然後憑藉詳細的系統記錄和資料的次要複本以進行後續的司法分析。In the event of an outage, you should also limit the initial time spent investigating the root cause during the outage, and instead focus on validating the health of your recovery environment, and then rely upon detailed system logs and secondary copies of data for subsequent forensic analysis.

規劃災害復原Planning for disaster recovery

雖然在高可用性上的努力與您針對防範中斷狀況所採取的措施息息相關,災害復原工作才是中斷發生後可致力重建高可用性所需的措施。While high availability efforts entail what you do to prevent an outage, disaster recovery efforts address what is done to re-establish high availability after the outage.

在實際發生中斷之前,應盡可能規劃災害復原程序及責任歸屬。根據作用中的監視和提醒、啟動自動或手動容錯移轉的決策及復原計畫應繫結於預先建立的 RTO 及 RPO 臨界值。健全的災害復原計畫應包含幾個範圍:As much as possible, disaster recovery procedures and responsibilities should be formulated before an actual outage occurs. Based upon active monitoring and alerts, the decision to initiate an automated or manual failover and recovery plan should be tied to pre-established RTO and RPO thresholds. The scope of a sound disaster recovery plan should include:

  • 失敗及復原的精細調整。根據位置和失效類型,您可在不同的層級採取修正措施;也就是資料中心、基礎結構、平台、應用程式或工作負載等層級。Granularity of failure and recovery. Depending upon the location and type of failure, you can take corrective action at different levels; that is, data center, infrastructure, platform, application, or workload.

  • 調查來源材料。基準和最近監視記錄、系統提醒、事件記錄及診斷查詢應使適當的人員可立即使用。Investigative source material. Baseline and recent monitoring history, system alerts, event logs, and diagnostic queries should all be readily accessible by appropriate parties.

  • 相依性協調。在應用程式堆疊和關係人之間,系統和工作相依性為何?Coordination of dependencies. Within the application stack, and across stakeholders, what are the system and business dependencies?

  • 決策樹。 預先判定的、可重複的、經驗證的決策樹包含 RPO 及 RTO 目標方面的角色職責、錯誤分級、容錯移轉準則,以及指定的復原步驟。Decision tree. A predetermined, repeatable, validated decision tree that includes role responsibilities, fault triage, failover criteria in terms of RPO and RTO goals, and prescribed recovery steps.

  • 驗證。在採取從中斷復原的步驟之後,必須要做什麼以驗證系統已回到正常運作?Validation. After taking steps to recover from the outage, what must be done to verify that the system has returned to normal operations?

  • 文件。擷取以上所有項目為一套文件,提供足夠的細節和清楚指示,第三方團隊便能夠自行執行復原計畫。此文件類型通常稱為「操作手冊」或「說明書」。Documentation. Capture all of the above items in a set of documentation, with sufficient detail and clarity so that a third party team can execute the recovery plan with minimal assistance. This type of documentation is commonly called a 'run book' or a 'cook book'.

  • 復原預演。定期演練災害復原計畫以建立 RTO 目標的基準期望值,並考慮在主控主要生產的網站及每個災害復原網站上定期輪流。Recovery rehearsals. Regularly exercise the disaster recovery plan to establish baseline expectations for RTO goals, and consider regular rotation of hosting the primary production site on the primary and each of the disaster recovery sites.

另請參閱See also

概念Concepts

選擇 SharePoint Server 的災害復原策略Choose a disaster recovery strategy for SharePoint Server