Azure Kubernetes Service (AKS) 節點自動修復Azure Kubernetes Service (AKS) node auto-repair

AKS 會持續監視背景工作節點的健全狀況狀態,並在狀況不良時執行自動節點修復。AKS continuously monitors the health state of worker nodes and performs automatic node repair if they become unhealthy. Azure 虛擬機器 (VM) 平臺會在遇到問題的 vm 上執行維護The Azure virtual machine (VM) platform performs maintenance on VMs experiencing issues.

AKS 和 Azure Vm 會一起運作,以將叢集的服務中斷降至最低。AKS and Azure VMs work together to minimize service disruptions for clusters.

在本檔中,您將瞭解 Windows 與 Linux 節點的自動節點修復功能如何運作。In this document, you'll learn how automatic node repair functionality behaves for both Windows and Linux nodes.

AKS 如何檢查狀況不良的節點How AKS checks for unhealthy nodes

AKS 會使用下列規則來判斷節點是否狀況不良,並需要修復:AKS uses the following rules to determine if a node is unhealthy and needs repair:

  • 節點會在10分鐘的時間範圍內,報告連續檢查的 NotReady 狀態。The node reports NotReady status on consecutive checks within a 10-minute timeframe.
  • 節點不會在10分鐘內報告任何狀態。The node doesn't report any status within 10 minutes.

您可以使用 kubectl 手動檢查節點的健全狀況狀態。You can manually check the health state of your nodes with kubectl.

kubectl get nodes

自動修復的運作方式How automatic repair works


AKS 會使用使用者帳戶 AKS (remediator) 來起始修復作業。AKS initiates repair operations with the user account aks-remediator.

如果 AKS 發現有10分鐘保持狀況不良的狀況不良節點,AKS 會採取下列動作:If AKS identifies an unhealthy node that remains unhealthy for 10 minutes, AKS takes the following actions:

  1. 重新開機節點。Reboot the node.
  2. 如果重新開機失敗,請重新安裝節點的映射。If the reboot is unsuccessful, reimage the node.
  3. 如果重新安裝映射失敗,請建立新節點並重新安裝映射。If the reimage is unsuccessful, create and reimage a new node.

如果自動修復失敗,AKS 工程師會調查替代補救。Alternative remediations are investigated by AKS engineers if auto-repair is unsuccessful.

如果 AKS 在健康情況檢查期間發現多個狀況不良的節點,則會在另一次修復開始之前個別修復每個節點。If AKS finds multiple unhealthy nodes during a health check, each node is repaired individually before another repair begins.

後續步驟Next steps

使用可用性區域提高 AKS 叢集工作負載的高可用性。Use Availability Zones to increase high availability with your AKS cluster workloads.