Azure Kubernetes Service (AKS) node auto-repair

AKS continuously monitors the health state of worker nodes and performs automatic node repair if they become unhealthy. The Azure virtual machine (VM) platform performs maintenance on VMs experiencing issues.

AKS and Azure VMs work together to minimize service disruptions for clusters.

In this document, you'll learn how automatic node repair functionality behaves for both Windows and Linux nodes.

How AKS checks for unhealthy nodes

AKS uses the following rules to determine if a node is unhealthy and needs repair:

  • The node reports NotReady status on consecutive checks within a 10-minute timeframe.
  • The node doesn't report any status within 10 minutes.

You can manually check the health state of your nodes with kubectl.

kubectl get nodes

How automatic repair works

Note

AKS initiates repair operations with the user account aks-remediator.

If AKS identifies an unhealthy node that remains unhealthy for 10 minutes, AKS takes the following actions:

  1. Reboot the node.
  2. If the reboot is unsuccessful, reimage the node.

Alternative remediations are investigated by AKS engineers if auto-repair is unsuccessful.

If AKS finds multiple unhealthy nodes during a health check, each node is repaired individually before another repair begins.

Limitations

In many cases, AKS can determine if a node is unhealthy and attempt to repair the issue, but there are cases where AKS either can't repair the issue or can't detect that there is an issue. For example, AKS can't detect issues if a node status is not being reported due to error in network configuration.

Next steps

Use Availability Zones to increase high availability with your AKS cluster workloads.