AKS Scale Down Node Protection

Jamie 0

To resolve an issue on a node (where a node's node-problem-detector was memory leaking and wasting CPU), we attempted to replace the node. To do this, we scaled up a Node Pool to 2 nodes, drained the original node, then scaled down.

Before starting the scale-down operation, we marked the newly-created node to be protected from scale set actions.

However, the backplane unexpectedly decided to scale down the new/protected node (vmss000002), rather than the problematic one (vmss000000). This left the deployments in a Pending state, as vmss000000 was still cordoned from the drain.

This does strongly imply that the Azure Kubernetes Service does not apply its changes to the underlying scale sets in such a way that respects the protection policies you can configure. Am I missing something in the documentation/is this expected behaviour, or is this a bug?

In either case, as per the aforementioned scenario, what is the best method to effectively replace a node without downtime?

Anveshreddy Nimmala 2,705 Reputation points Microsoft Vendor

2024-01-03T09:24:19.6566667+00:00

Hello@Jamie ,

Based on the provided information, it seems that the Azure Kubernetes Service did not apply its changes to the underlying scale sets in such a way that respects the protection policies you can configure. To investigate this issue further, I would need more information such as the version of AKS you are using, the configuration of your node pools, and the protection policies you have configured.

In general, when you mark a node as protected, it should not be selected for scale-in operations. If you have followed the correct steps to mark the newly-created node as protected, it should not have been scaled down.

I would recommend checking the protection policies you have configured and verifying that they are correctly applied. You can also check the logs to see if there were any errors or warnings related to the scale-down operation.

Hope this helps you.
Anveshreddy Nimmala 2,705 Reputation points Microsoft Vendor

2024-01-05T01:48:35.4966667+00:00

Hello @Jamie,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Anveshreddy Nimmala 2,705 Reputation points Microsoft Vendor

2024-01-08T03:52:14.7233333+00:00

Hello @Jamie,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

3 answers

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more
Jamie 0 Reputation points

2023-12-29T15:22:28.6533333+00:00

I can see an answer was posted and deleted - but it was exactly what I was looking for!

If anyone else comes across this, you would scale up the node pool to N+1, drain the node via kubectl, then re-image the faulty node through the Portal, then uncordon and scale down.
Please sign in to rate this answer.
rilopes-MSFT 160 Reputation points Microsoft Employee

2023-12-29T15:45:08.1466667+00:00

Hi Jamie, thanks for your feedback. I'm not sure with my answer was deleted but I've posted it again for the reference.
Sign in to comment
Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Share via

AKS Scale Down Node Protection

3 answers