Azure Cache for Redis 的容錯移轉和修補Failover and patching for Azure Cache for Redis

若要建立具有復原能力和成功的用戶端應用程式,請務必瞭解 Azure Cache for Redis 服務內容中的容錯移轉。To build resilient and successful client applications, it's critical to understand failover in the context of the Azure Cache for Redis service. 容錯移轉可以是規劃管理作業的一部分,也可能是因為未規劃的硬體或網路故障所造成。A failover can be a part of planned management operations, or might be caused by unplanned hardware or network failures. 當管理服務修補 Azure Cache for Redis 二進位檔時,快取容錯移轉的常見用法。A common use of cache failover comes when the management service patches the Azure Cache for Redis binaries. 本文涵蓋什麼是容錯移轉、在修補期間發生的狀況,以及如何建立具復原功能的用戶端應用程式。This article covers what a failover is, how it occurs during patching, and how to build a resilient client application.

什麼是容錯移轉?What is a failover?

讓我們從 Azure Cache for Redis 的容錯移轉總覽開始著手。Let's start with an overview of failover for Azure Cache for Redis.

快取架構的快速摘要A quick summary of cache architecture

快取是使用不同的私人 IP 位址來建立多部虛擬機器。A cache is constructed of multiple virtual machines with separate, private IP addresses. 每個虛擬機器(也稱為節點)都會連接到具有單一虛擬 IP 位址的共用負載平衡器。Each virtual machine, also known as a node, is connected to a shared load balancer with a single virtual IP address. 每個節點都會執行 Redis 伺服器進程,而且可透過主機名稱和 Redis 埠來存取。Each node runs the Redis server process and is accessible by means of the host name and the Redis ports. 每個節點都會被視為主要或複本節點。Each node is considered either a primary or a replica node. 當用戶端應用程式連線到快取時,它的流量會通過此負載平衡器,並自動路由到主要節點。When a client application connects to a cache, its traffic goes through this load balancer and is automatically routed to the primary node.

在基本的快取中,單一節點一律是主要節點。In a Basic cache, the single node is always a primary. 在標準或高階快取中,有兩個節點:一個是選擇作為主要節點,另一個則是複本。In a Standard or Premium cache, there are two nodes: one is chosen as the primary and the other is the replica. 因為 Standard 和 Premium 快取具有多個節點,所以一個節點可能無法使用,而另一個節點則繼續處理要求。Because Standard and Premium caches have multiple nodes, one node might be unavailable while the other continues to process requests. 叢集快取是由許多分區所組成,每個都有不同的主要和複本節點。Clustered caches are made of many shards, each with distinct primary and replica nodes. 其中一個分區可能會關閉,而其他則保持可用。One shard might be down while the others remain available.

注意

基本快取不會有多個節點,也不會提供服務等級協定 (SLA) 來提供其可用性。A Basic cache doesn't have multiple nodes and doesn't offer a service-level agreement (SLA) for its availability. 基本快取只建議用於開發和測試用途。Basic caches are recommended only for development and testing purposes. 使用標準或高階快取進行多重節點部署,以增加可用性。Use a Standard or Premium cache for a multi-node deployment, to increase availability.

容錯移轉的說明Explanation of a failover

當複本節點將本身升級成為主要節點,而舊的主要節點關閉現有的連接時,就會發生容錯移轉。A failover occurs when a replica node promotes itself to become a primary node, and the old primary node closes existing connections. 在主要節點恢復運作之後,它會注意角色的變更,並將本身降級為複本。After the primary node comes back up, it notices the change in roles and demotes itself to become a replica. 然後,它會連接到新的主要複本,並同步處理資料。It then connects to the new primary and synchronizes data. 可能會有規劃或未規劃的容錯移轉。A failover might be planned or unplanned.

規劃的容錯移轉會在系統更新(例如 Redis 修補或 OS 升級,以及管理作業,例如調整和重新開機)期間進行。A planned failover takes place during system updates, such as Redis patching or OS upgrades, and management operations, such as scaling and rebooting. 由於節點會提前通知更新,因此可以以合作方式交換角色並快速更新變更的負載平衡器。Because the nodes receive advance notice of the update, they can cooperatively swap roles and quickly update the load balancer of the change. 規劃的容錯移轉通常會在1秒內完成。A planned failover typically finishes in less than 1 second.

計畫的容錯移轉 可能是因為硬體故障、網路失敗或主要節點的其他意外中斷所造成。An unplanned failover might happen because of hardware failure, network failure, or other unexpected outages to the primary node. 複本節點將本身升級為主要複本,但進程需要較長的時間。The replica node promotes itself to primary, but the process takes longer. 複本節點在可以起始容錯移轉程式之前,必須先偵測到它的主要節點無法使用。A replica node must first detect that its primary node is not available before it can initiate the failover process. 複本節點也必須確認這個未規劃的失敗不是暫時性或本機,以避免不必要的容錯移轉。The replica node must also verify that this unplanned failure is not transient or local, to avoid an unnecessary failover. 此偵測延遲表示未計畫的容錯移轉通常會在10到15秒內完成。This delay in detection means that an unplanned failover typically finishes within 10 to 15 seconds.

修補如何進行?How does patching occur?

Azure Cache for Redis 服務會定期以最新的平臺功能和修正程式來更新您的快取。The Azure Cache for Redis service regularly updates your cache with the latest platform features and fixes. 若要修補快取,服務會遵循下列步驟:To patch a cache, the service follows these steps:

  1. 管理服務會選取要修補的一個節點。The management service selects one node to be patched.
  2. 如果選取的節點是主要節點,對應的複本節點會以合作方式自我升級。If the selected node is a primary node, the corresponding replica node cooperatively promotes itself. 這項促銷活動被視為規劃的容錯移轉。This promotion is considered a planned failover.
  3. 選取的節點會重新開機以進行新的變更,並回到複本節點。The selected node reboots to take the new changes and comes back up as a replica node.
  4. 複本節點會連接到主要節點,並同步處理資料。The replica node connects to the primary node and synchronizes data.
  5. 當資料同步處理完成時,會針對其餘節點重複進行修補程式。When the data sync is complete, the patching process repeats for the remaining nodes.

因為修補是規劃的容錯移轉,所以複本節點會快速地將本身升級為主要複本,並開始服務要求和新的連接。Because patching is a planned failover, the replica node quickly promotes itself to become a primary and begins servicing requests and new connections. 基本快取沒有複本節點,而且在更新完成之前無法使用。Basic caches don't have a replica node and are unavailable until the update is complete. 叢集快取的每個分區都是分開修補的,且不會關閉與另一個分區的連接。Each shard of a clustered cache is patched separately and won't close connections to another shard.

重要

一次只會修補一個節點,以防止資料遺失。Nodes are patched one at a time to prevent data loss. 基本快取將會遺失資料。Basic caches will have data loss. 叢集快取一次會修補一個分區。Clustered caches are patched one shard at a time.

相同資源群組和區域中的多個快取一次也會進行修補。Multiple caches in the same resource group and region are also patched one at a time. 位於不同資源群組或不同區域的快取可能會同時修補。Caches that are in different resource groups or different regions might be patched simultaneously.

由於完整資料同步處理會在程式重複之前進行,因此當您使用標準或高階快取時,不太可能發生資料遺失。Because full data synchronization happens before the process repeats, data loss is unlikely to occur when you use a Standard or Premium cache. 您可以藉由 匯出 資料並啟用 持續性來防止資料遺失。You can further guard against data loss by exporting data and enabling persistence.

額外的快取負載Additional cache load

當發生容錯移轉時,Standard 和 Premium 快取需要將資料從一個節點複寫到另一個節點。Whenever a failover occurs, the Standard and Premium caches need to replicate data from one node to the other. 這項複寫會導致伺服器記憶體和 CPU 的負載增加。This replication causes some load increase in both server memory and CPU. 如果快取實例已大量載入,用戶端應用程式可能會遇到延遲增加。If the cache instance is already heavily loaded, client applications might experience increased latency. 在極端情況下,用戶端應用程式可能會收到超時例外狀況。In extreme cases, client applications might receive time-out exceptions. 為了有助於減輕此額外負載的影響 ,請設定 快取的 maxmemory-reserved 設定。To help mitigate the impact of this additional load, configure the cache's maxmemory-reserved setting.

容錯移轉如何影響我的用戶端應用程式?How does a failover affect my client application?

用戶端應用程式所看到的錯誤數目,取決於容錯移轉時該連接上暫止的作業數目。The number of errors seen by the client application depends on how many operations were pending on that connection at the time of the failover. 透過關閉其連線的節點路由傳送的任何連接都會看到錯誤。Any connection that's routed through the node that closed its connections will see errors. 當連線中斷時,許多用戶端程式庫可能會擲回不同類型的錯誤,包括超時例外狀況、連接例外狀況或通訊端例外狀況。Many client libraries can throw different types of errors when connections break, including time-out exceptions, connection exceptions, or socket exceptions. 例外狀況的數目和類型取決於程式碼路徑中的要求在快取關閉其連接時的位置。The number and type of exceptions depends on where in the code path the request is when the cache closes its connections. 比方說,當容錯移轉發生時,傳送要求但未收到回應的作業,可能會收到超時例外狀況。For instance, an operation that sends a request but hasn't received a response when the failover occurs might get a time-out exception. 在關閉的連線物件上,新的要求會收到連接例外狀況,直到重新連線成功為止。New requests on the closed connection object receive connection exceptions until the reconnection happens successfully.

大部分的用戶端程式庫會嘗試重新連線至快取(如果有設定的話)。Most client libraries attempt to reconnect to the cache if they're configured to do so. 不過,未預期的錯誤有時可能會將程式庫物件放入無法復原的狀態。However, unforeseen bugs can occasionally place the library objects into an unrecoverable state. 如果錯誤保存超過預先設定的時間長度,則應該重新建立連線物件。If errors persist for longer than a preconfigured amount of time, the connection object should be recreated. 在 Microsoft.NET 和其他物件導向的語言中,您可以使用 延遲 <T> 模式來重建連接,而不需要重新開機應用程式。In Microsoft.NET and other object-oriented languages, recreating the connection without restarting the application can be accomplished by using a Lazy<T> pattern.

如何? 讓我的應用程式具有復原能力?How do I make my application resilient?

因為您無法完全避免容錯移轉,所以請撰寫用戶端應用程式以復原連接中斷和失敗的要求。Because you can't avoid failovers completely, write your client applications for resiliency to connection breaks and failed requests. 雖然大部分的用戶端程式庫會自動重新連線至快取端點,但其中少數會嘗試重試失敗的要求。Although most client libraries automatically reconnect to the cache endpoint, few of them attempt to retry failed requests. 視應用程式案例而定,使用重試邏輯搭配輪詢可能是合理的。Depending on the application scenario, it might make sense to use retry logic with backoff.

若要測試用戶端應用程式的復原能力,請使用 重新開機 做為連接中斷的手動觸發程式。To test a client application's resiliency, use a reboot as a manual trigger for connection breaks. 此外,我們建議您在快取上 排程更新Additionally, we recommend that you schedule updates on a cache. 告知管理服務在指定的每週時段套用 Redis 執行時間修補程式。Tell the management service to apply Redis runtime patches during specified weekly windows. 這些視窗通常是用戶端應用程式流量較低的期間,以避免可能發生的事件。These windows are typically periods when client application traffic is low, to avoid potential incidents.

用戶端網路-設定變更Client network-configuration changes

某些用戶端網路設定變更可能會觸發「沒有可用的連線」錯誤。Certain client-side network-configuration changes can trigger "No connection available" errors. 這類變更可能包括:Such changes might include:

  • 在預備與生產位置之間交換用戶端應用程式的虛擬 IP 位址。Swapping a client application's virtual IP address between staging and production slots.
  • 調整應用程式的實例大小或數目。Scaling the size or number of instances of your application.

這類變更可能會導致連線問題持續不到一分鐘。Such changes can cause a connectivity issue that lasts less than one minute. 除了 Azure Cache for Redis 服務之外,您的用戶端應用程式可能會失去與其他外部網路資源的連線。Your client application will probably lose its connection to other external network resources in addition to the Azure Cache for Redis service.

接下來的步驟Next steps