AKS Scale Down Node Protection

Jamie 0 Reputation points
2023-12-29T08:58:39.4+00:00

To resolve an issue on a node (where a node's node-problem-detector was memory leaking and wasting CPU), we attempted to replace the node. To do this, we scaled up a Node Pool to 2 nodes, drained the original node, then scaled down.

Before starting the scale-down operation, we marked the newly-created node to be protected from scale set actions.

However, the backplane unexpectedly decided to scale down the new/protected node (vmss000002), rather than the problematic one (vmss000000). This left the deployments in a Pending state, as vmss000000 was still cordoned from the drain.

This does strongly imply that the Azure Kubernetes Service does not apply its changes to the underlying scale sets in such a way that respects the protection policies you can configure. Am I missing something in the documentation/is this expected behaviour, or is this a bug?

In either case, as per the aforementioned scenario, what is the best method to effectively replace a node without downtime?

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,883 questions
Azure Virtual Machine Scale Sets
Azure Virtual Machine Scale Sets
Azure compute resources that are used to create and manage groups of heterogeneous load-balanced virtual machines.
353 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  2. Jamie 0 Reputation points
    2023-12-29T15:22:28.6533333+00:00

    I can see an answer was posted and deleted - but it was exactly what I was looking for!

    If anyone else comes across this, you would scale up the node pool to N+1, drain the node via kubectl, then re-image the faulty node through the Portal, then uncordon and scale down.


  3. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more