question

ChengPeng-1500 avatar image
0 Votes"
ChengPeng-1500 asked srbose-msft answered

AKS Node restarts regardless of PDB

I have several Pods with PodDisruptionBudget (max unavailable is 1) in an AKS node pool (one Node with one Pod). I upgraded the node pool right after control plane upgrade finished (Kubernetes version from v1.18 to v1.19), and then more than one Node restarting at the same time, which led to workload downtime.
According to AKS CSS team, after control plane upgraded, an additional reconciliation loop would be triggered by design due to a PUT request to check if configurations in VMSS instances are expected; if not, a restart/reimage may be carried out. So during my node pool upgrade, some Nodes unexpectedly restarted twice.
My question is what kinds of AKS reconciliation would restart Nodes regardless of PDB? I doubt there's some risk to do certain operation as Nodes would not be drained and directly restart sometimes.

azure-kubernetes-service
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

srbose-msft avatar image
0 Votes"
srbose-msft answered

@ChengPeng-1500 , Thank you for your question.

The various AKS reconcilers (including the agent pool reconciler, which can impact your node pools) operate in an idempotent fashion. The reconciler will only operate when something is out of goal state and will aim to resolve that component to the goal state that is defined in the managed cluster profile. The evaluation for the need of reconciler to operate does not happen after an upgrade completes unless the upgrade concluded in a Failed State. In case of a PUT operation on the AKS cluster ending in a Failed State, the cluster is marked for auto reconciliation at the AKS Resource Provider which serves at a first-come-first-serve format in any Azure region.

In general, the evaluation for reconciler to operate is part of any PUT operation (upgrade, scale, update, rotate certificates etc) and the reconciler (if needed) is triggered as a sub-operation. Reconciler pattern applies to create, delete, scale and upgrade agentpools, for both VMSS and VMAS agentpools.

In your case, the goal state for upgrade of node pool would have been set on the managed cluster profile first and then agent pool reconciler (if needed for some other out-of-goal state) would account for the target Kubernetes version and perform the reimaging of the nodes in a go, respecting Pod Disruption Budgets during cordon and drain of the nodes. If Pod Disruption Budgets (other than system defined) would have hindered this operation then the upgrade operation would have exited leaving the cluster in a Failed Provisioning State. Unless, PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time, the drain/evict operation will fail even if the reconciler attempts to drain the nodes any number of times subsequently. [Reference]

Any AKS PUT operation and by extension reconcilers, are logically designed to gracefully drain nodes (if needed) and if unable to do so exit as Failed.

Multiple nodes restarting at the same time during a node pool upgrade can be easily achieved through increased node surge upgrades. But they still respect user defined PDBs during cordon and drain operations.

However, there are several other scenarios where nodes might restart and PDBs are not honored. For example, if there is a Power Distribution Unit (PDU) fault or the VM host goes down or some error on the guest OS etc. which do not take into account Kubernetes orchestration, workloads might be impacted in spite of Pod Disruption Budgets. We highly recommend you to review this article for business continuity and disaster recovery best practices in AKS.

Another possibility is that if correct readiness probes were not set on the pods, the newly created pod due to a node drain might have reached Running state without the application on it actually going live and Kubernetes decided to start the eviction of another pod from the same replica set rendering the application unreachable for some time.

It might be worthwhile to revisit the node restart operation logs to clearly identify exactly what initiated multiple restarts on the same node and ones that disregarded the PDBs.


Hope this helps.

Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.