高可用性和嚴重損壞修復High Availability and Disaster Recovery

重要

已不再支援此版本的 Operations Manager,建議升級至 Operations Manager 2019This version of Operations Manager has reached the end of support, we recommend you to upgrade to Operations Manager 2019.

System Center – Operations Manager 伺服器和功能都可能失敗,進而影響 Operations Manager 功能。System Center – Operations Manager servers and features can potentially fail, impacting Operations Manager functionality. 故障期間遺失的資料量和喪失的功能,會因每個故障案例而異。The amount of data and functionality lost during a failure is different in each failure scenario. 這取決於故障功能的角色、復原故障功能所需的時間。It depends on the role of the failing feature, the length of time it takes to recover the failing feature.

高可用性High availability

高可用性需求的解決方式是,在 Operations Manager 操作和資料倉儲資料庫的管理群組、閘道和管理伺服器,以及特定工作負載中建置備援。High-availability needs are addressed by building redundancy into the management group for the Operations Manager operational and data warehouse databases, the gateway and management servers, and specific workloads. 這些工作負載包括網路裝置監視、跨平台監視,以及先前由 Root Management Server 所管理的管理群組特定工作負載。These workloads include network device monitoring, cross-platform monitoring, and management group-specific workloads that were previously managed by the Root Management Server.

多部伺服器的單一管理群組設定可利用 SQL Server Always On,來提供 Operations Manager 資料庫的高可用性和服務持續性。The multiple servers, single management group configuration can make use of SQL Server Always On for providing high availability and service continuity of the Operations Manager databases. 若要提供管理伺服器容錯,至少要有兩部管理伺服器,並使用資源集區監視 UNIX 伺服器、Linux 伺服器和網路裝置。Management server fault-tolerance is provided by having at least two management servers and by using the resource pools for monitoring UNIX servers, Linux servers, and network devices. 您可以使用主要及次要管理伺服器來設定代理程式型 Windows 伺服器,以便在管理伺服器失敗時,重新導向代理程式通訊。Agent-based Windows servers can be configured with a primary and secondary management server to redirect agent communications should a management server fail.

如果裝載 RMS 模擬器的管理伺服器變成無法使用,RMS 模擬器也可以移到另一部管理伺服器。The RMS Emulator can be moved to another management server as well should the management server hosting the RMS Emulator become unavailable.

您可以為資料存取服務設定高可用性,以將 Operations 主控台連線設為高可用性。Operations console connections can be made highly available by configuring high availability for the Data Access Services. 這可以透過安裝 Microsoft 網路負載平衡 (NLB) 或使用硬體型負載平衡器或 DNS 別名進行。This can be done by installing Microsoft Network Load Balancing (NLB) or using a hardware-based load balancers, or DNS alias. 系統會將一或多部管理伺服器新增為 NLB 集區的成員,因此在開啟任一主控台時,您會參考在 DNS 中登錄的負載平衡管理伺服器虛擬名稱。One or more management servers are added as members of the NLB pool and when opening either the console, you reference the virtual name registered in DNS, of the load-balanced management servers.

注意

Operations Manager Web 主控台伺服器不支援網路負載平衡器。A Network Load Balancer is not supported for the Operations Manager web console server.

您可跨越信任界限部署多個閘道伺服器,為跨越信任界限的代理程式提供備援路徑。Multiple gateway servers can be deployed across a trust boundary to provide redundant pathways for agents that lie across that trust boundary. 如同代理程式可在主要管理伺服器和一或多個次要管理伺服器之間容錯移轉,代理程式也可以在閘道伺服器之間容錯移轉。Just as agents can fail over between a primary management server and one or more secondary management servers, they can also fail over between gateway servers. 此外,也可使用多個閘道伺服器來分散管理無代理程式管理的電腦和受管理的網路裝置的工作負載。In addition, multiple gateway servers can be used to distribute the workload of managing agentless-managed computers and managed network devices.

除了提供代理程式-閘道容錯移轉的備援之外,如果有多部管理伺服器可用,也可將閘道伺服器設定為在管理群組中的管理伺服器之間容錯移轉。In addition to providing redundancy through agent-gateway failover, gateway servers can be configured to fail over between management servers in a management group, if multiple management servers are available.

雖然 SQL Server Reporting Services 支援的向外延展部署模型可讓您執行共用單一報表伺服器資料庫的多個報表伺服器執行個體,但是 Operations Manager 不支援。While SQL Server Reporting Services supports a scale-out deployment model that allows you to run multiple report server instances that share a single report server database, it is not supported with Operations Manager. Operations Manager 報表會安裝一個自訂的安全性延伸模組,做為前端元件的設定,在整個Web 伺服陣列中都無法複寫此設定。Operations Manager Reporting installs a custom security extension as part of the setup of the front-end components, which cannot be replicated across the web farm.

嚴重損壞修復Disaster recovery

災害復原是一種相關採取措施,這些措施是為確保在發生重大故障 (例如,裝載主要基礎結構的整個資料中心中斷) 時可以繼續執行作業。Disaster recovery relates to measures taken to ensure that operations can be resumed if a catastrophic failure (for example, loss of the entire data center that hosts the primary infrastructure). 這是在任何部署中都必須考量的要素,而且在規劃嚴重損壞修復時所制定的決策會影響 Operations Manager 如何能夠繼續支援重要 IT 服務效能和可用性的主動監視和報告功能。It is an important element that must be considered in any deployment and the decisions that are made in planning for disaster recovery affect how Operations Manager will be able to continue supporting proactive monitoring and reporting of the performance and availability of your critical IT services. 本節將著重在嚴重損壞修復和復原的建議策略,以及為確保順利復原而應該採取的步驟。This section will focus on the recommended strategy of disaster recovery and resiliency and what steps should be taken to ensure a smooth recovery.

雖然 HA 和 DR 解決方案將會提供保護,以防系統失敗或系統遺失,但是,為防止意外、非預期,或惡意的資料遺失或損毀,不應該依賴這些解決方案。While HA and DR solutions will provide protection from system failure or system loss, they should not be relied on for protection from accidental, unintended, or malicious data loss or corruption. 在這些情況下,可能必須利用備份複本或延遲的複寫複本來還原作業。In these cases, back up copied or lagged replication copies might have to be leveraged for restore operations. 在許多情況下,還原作業是最適當的 DR 形式。In many cases, a restore operation is the most appropriate form of DR. 其中一個範例可能是低優先順序的報表資料庫或分析資料。One example of this could be a low-priority reporting database or analysis data. 在許多情況下,在系統或應用程式層級啟用多站台 DR 的成本遠遠超過資料的價值。In many cases, the cost to enable multisite DR at the system or application level far outweighs the value of the data. 如果資料的近期價值不高,且在發生嚴重失敗或站台 DR 時可延遲存取資料而不會嚴重影響業務,則建議您為 DR 使用簡單的備份和還原程序以節省成本。In cases in which the near-term value of the data is low and the need to access the data can be delayed without severe business impact if a failure or site DR excessive, consider using simple backup and restore processes for DR if the cost savings warrant it.

了解停機時間的影響及容錯能力有助於做出為正確設計 Operations Manager 的架構而必須了解的決策,以及支援嚴重損壞修復所需的複雜程度和成本。Understanding the impact and tolerance for downtime will help drive the decisions that need to be understood in order to properly design the architecture for Operations Manager and the level of complexity and cost required to support disaster recovery. 此外,請考量在不造成業務影響的情況下 IT 組織可以承受的監視資料遺失範圍。Additionally, consider the extent of monitoring data loss the IT organization can tolerate without causing business consequences. 這最好用兩個術語來描述:復原時間目標 (RTO) 和復原點目標 (RPO)。This is best described in two terms: recovery time objective (RTO) and recovery point objective (RPO).

Operations Manager 兩個最常見的嚴重損壞修復設計設定為:The two most common disaster recovery design configurations for Operations Manager are:

  • 建立部署到次要資料中心的重複管理群組,此管理群組在規模和設定上,都與主要管理群組重複。Creating a duplicate management group deployed to your secondary data center that duplicates in scale and configuration, the primary management group.
  • 在次要資料中心部署額外的伺服器以支援操作和資料倉儲資料庫,並在冷待命組態中部署管理伺服器,但不加入管理群組,直到必須執行復原動作為止。Deploying additional servers in a secondary data center to support the Operational and Data Warehouse database, with management servers deployed in a cold-standby configuration, not participating in the management group until recovery actions need to be performed.

部署重複的管理群組是在無法容忍停機時間時的一個選擇,但這是最複雜的選擇。Deploying a duplicate management group is an option when there is no tolerance for downtime; however, it is the most complex option. 兩者之間的設定必須保持一致,讓您可以移交時,受監視、警示或報告、呈現,以及最終呈報的內容不會有任何差異。Configuration between both needs to be consistent so that when you cut over, there is no difference in what is monitored, alerted or reported, presented, and finally escalated. 此外,也必須整合其他監視平台或 ITSM 平台 (例如 System Center - Service Manager、Remedy 或 ServiceNow),並可在主動/被動狀態下進行設定,以避免事件、設定項目等重複。代理程式在兩個管理群組之間將是多重主目錄的,因此將會有重複的資料。Integration with other monitoring platforms or ITSM platforms such as System Center - Service Manager, Remedy or ServiceNow will need to exist as well, and possibly configured in an active/passive state to avoid duplication of incidents, configuration items, etc. Agents will be multihomed between both management groups, so there will be duplication of data.

下圖是此設計案例的範例。The following diagram is an example of this design scenario.

重複 MG

如果您的 Operations Manager 部署不需要立即復原,而且您想要避免重複管理群組的複雜性,或者您可以在次要資料中心部署額外的管理群組元件,以維持管理群組的功能。If immediate recovery is not necessary for your Operations Manager deployment and you want to avoid the complexity of a duplicate management group, alternatively you can deploy additional management group components in your secondary data center in order to retain functionality of your management group. 至少請考慮實作 SQL Server 2014 或 2016 Always On 可用性群組,以便在兩個或多個資料中心之間復原操作和資料倉儲資料庫,其中兩個節點的容錯移轉叢集執行個體 (FCI) 是在主要資料中心部署,而獨立 SQL Server 則是在次要資料中心部署,做為單一 Windows Server 容錯移轉叢集 (WSFC) 的一部分。At a minimum, consider implementing a SQL Server 2014 or 2016 Always On Availability Group to provide recovery of the Operational and Data Warehouse databases between two or more datacenters, where a two-node failover cluster instance (FCI) is deployed in the primary data center, and a standalone SQL Server in the secondary datacenter as part of a single Windows Server Failover Cluster (WSFC). Always On 可用性群組的次要複本將會在非 FCI 獨立執行個體上,如下圖所示。The secondary replica for the Always On Availability Group would be on the non-FCI standalone instance as shown in the following diagram.

簡易 DR 設定

在此範例中,您必須使用相同的硬體設定和電腦名稱部署一或多部 Windows Server,並使用 /Recover 參數,重新安裝管理伺服器角色。In this example, you would be required to deploy one or more Windows Servers with the same hardware configuration and computer name, and reinstall the management server role using the /Recover parameter. 在此期間,代理程式會將收集的資料 (警示、事件、效能等等) 排入佇列,直到代理程式可以繼續與管理群組中的管理伺服器進行通訊為止。During this time, agents will queue the data collected (alerts, events, performance, etc.) until they can resume communication with a management server in the management group. 此方法可避免安裝新的 SQL Server 執行個體,並從您上次已知良好的備份還原資料庫。This approach avoids installing new instances of SQL Server and restoring databases from your last known good backup. 不過,在此復原案例中,假設您需要部署繼續基本監視功能所需的其他角色,返回至可操作狀態可能會有更長的延遲。However, in this recovery scenario there is likely going to be a longer delay in returning to an operable state given you will need to deploy the other roles necessary to resume minimum monitoring functionality. 如果這個方法無法接受,您可以在次要資料中心部署管理伺服器以進行待命復原。If this approach isn't acceptable, you can deploy management servers in your secondary data center for on-standby recovery. 將這些管理伺服器以「所有管理伺服器資源集區」、「通知」和「AD 指派」這三個主要資源集區的成員形式移除。Remove them as members of the three primary resources pools - All Management Servers Resource Pool, Notifications, and AD Assignment. 這也包含任何自訂資源集區 (可能包括裝載於主要資料中心的管理伺服器),且必須在復原計劃期間繼續運作。This also includes any custom resource pool, which may include management servers hosted in the primary data center and need to continue to function as part of the recovery plan. System Center Data Access、System Center Configuration Management,以及 Microsoft Monitoring Agent 服務應該停止,並設定為手動或停用,而且僅在嚴重損壞修復案例中啟動。The System Center Data Access, System Center Configuration Management, and Microsoft Monitoring Agent services should be stopped and set to manual or disable and only started in a disaster recovery scenario.
如果管理伺服器支援整合 (透過直接裝載在管理伺服器上或來自其他 System Center 產品的連接器,例如 VMM、Orchestrator 或 Service Manager),則您需要根據整合設定和復原步驟順序來規劃此作業的手動或自動復原步驟。If a management server is supporting integration (via a connector hosted directly on the management server or from another System Center product such as VMM, Orchestrator or Service Manager), this will need to be planned for with manual or automatic recovery steps depending on the integration configuration and sequence of recovery steps. 這可確保針對需要實作災害復原規劃的情況,徹底掌握管理伺服器的其他任何相依性,並加以規劃。This ensures any other dependency on the management server is captured and planned for when the disaster recovery plan needs to be implemented.

複雜 DR 設定Complex DR Config

如果一個網站離線,代理程式將會容錯移轉至另一個網站中的管理伺服器,並假設代理程式的容錯移轉設定允許這個動作。If one site goes offline, the agent will fail over to the management server in another site, assuming that the agent’s failover configuration allows this. 將 Windows 代理程式重新設定為僅快取主要資料中心內應該管理它們的管理伺服器,以防止它們嘗試容錯移轉至次要資料中心的管理伺服器 (該管理伺服器僅會延遲復原和報告)。Reconfigure the Windows agents to cache only management servers in your primary data center that should manage them to prevent them from attempting to failover to a management server in the secondary data center, which would only delay recovery and reporting. 若要完成上述作業,您可以使用指令碼 (例如 VBScript 或 PowerShell) 透過自動化方式手動部署代理程式,以在安裝期間預先設定;或者,您可以使用由企業設定管理解決方案所管理的指令碼方法,從主控台推送代理程式,以在部署之後預先設定。This can be accomplished if you manually deploy the agent in an automated manner with a script (for example, VBScript or better yet, PowerShell) to pre-configure during installation, or post deployment if you push the agent from the console, again using a scripted method managed with your enterprise configuration management solution.

您可以在 Azure 虛擬機器上部署 Operations Manager,做為維持管理群組持續性的替代嚴重損壞修復選項。Operations Manager can be deployed on Azure virtual machines as an alternative disaster recovery option to maintain continuity of the management group. 您也必須在 Azure 中的虛擬機器上 (而不是在混合式設定中) 部署 SQL Server,因為管理伺服器與裝載 Operations Manager 資料庫之 SQL Server 之間的延遲將會對管理群組的效能產生負面影響。It will be necessary to also deploy SQL Server on a virtual machine in Azure and not in a hybrid configuration, as the latency between a management server and the SQL Server hosting the Operations Manager databases will negatively impact performance of the management group.
若要在 Azure IaaS 或其他公用雲端提供者內正確架構此案例,請考量監視範圍、網路拓撲以及與 Microsoft Azure 的連線 (亦即,站對站 VPN 或 ExpressRoute)、整合點 (亦即,ITSM 解決方案、其他 System Center 產品、協力廠商附加元件等)、主控台存取權、法規或相關法律或原則等。Consider the monitoring scope, network topology, and network connectivity to Microsoft Azure (that is, site-to-site VPN or ExpressRoute), integration points (that is, ITSM solutions, other System Center products, third-part add-ons, etc.), console access, regulatory or relevant laws or policies, etc. in order to properly architect this scenario within Azure IaaS or other public cloud providers.