To resolve an issue on a node (where a node's node-problem-detector
was memory leaking and wasting CPU), we attempted to replace the node. To do this, we scaled up a Node Pool to 2 nodes, drained the original node, then scaled down.
Before starting the scale-down operation, we marked the newly-created node to be protected from scale set actions.
However, the backplane unexpectedly decided to scale down the new/protected node (vmss000002), rather than the problematic one (vmss000000). This left the deployments in a Pending state, as vmss000000 was still cordoned from the drain.
This does strongly imply that the Azure Kubernetes Service does not apply its changes to the underlying scale sets in such a way that respects the protection policies you can configure. Am I missing something in the documentation/is this expected behaviour, or is this a bug?
In either case, as per the aforementioned scenario, what is the best method to effectively replace a node without downtime?