question

PoluriVenudhar-7486 avatar image
0 Votes"
PoluriVenudhar-7486 asked KaplingatNikhil-7766 answered

AKS automatic Node repair with VMSS

In AKS, When we shutdown a VM , it is recognized as NotReady, but it is not coming up even after 30 minutes. We are using zones and with that Virtual Machine scale sets are automatically enabled. So we created a health extension(ApplicationHealthLinux) on the VMSS created by AKS. And when we are enabling automatic repairs on the VMSS it is failing with the below error -
"Automatic repairs not supported for this Virtual Machine Scale Set because a health probe or health extension was not provided".

Is automatic node repairs supported in AKS with VMSS? And are there any alternatives/workarounds?

azure-kubernetes-serviceazure-virtual-machines-scale-setazure-virtual-machines-extension
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

srbose-msft avatar image
0 Votes"
srbose-msft answered

@PoluriVenudhar-7486 , Thank you for the question.

While AKS has resilience mechanisms to withstand a VM stop or deallocate config and recover from it, this isn't a supported configuration. Stop your cluster instead.
Azure Kubernetes Service node auto-repair applies but works differently than Automatic instance repairs for Azure virtual machine scale sets.

If the node is in a NotReady State for a long time after the node VM has started please try the following steps:

  1. SSH to the node. How-to

  2. Collect kubelet logs. How-to

  3. Check if the docker daemon is running with sudo systemctl status docker [For containerd use sudo systemctl status containerd]. For Windows nodes use Get-Service command

  4. If it is inactive, try starting docker using sudo systemctl start docker [For containerd use sudo systemctl start containerd]. For Windows nodes use Start-Service command

  5. Check if the kubelet service is running with sudo systemctl status kubelet. For Windows nodes use Get-Service

  6. If it is inactive, try starting the kubelet service using sudo systemctl start kubelet. For Windows nodes use Start-Service

  7. If the node is still in a NotReady state try restarting the VM/VMSS instance.

If you are still facing the issue please do let us know.



Hope this helps.

Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.




5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

PoluriVenudhar-7486 avatar image
0 Votes"
PoluriVenudhar-7486 answered srbose-msft edited

Thanks @srbose-msft for your inputs.

We are actually trying to mimic the VM failure, we want to check how and when the AKS brings it back.
So we tried to stop the VMSS Instance or login to a node and shut it down, in those cases AKS is not brining up the VM automatically, not sure why?
Is there any way we can mimic the VM failure, and verify the automatic repair in AKS?

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@PoluriVenudhar-7486 , Currently there is no standard way to check how AKS remediator kicks in on situations where nodes are NotReady. This is because in many cases, AKS can determine if a node is unhealthy and attempt to repair the issue, but there are cases where AKS either can't repair the issue or can't detect that there is an issue. For example, AKS can't detect issues if a node status is not being reported due to error in network configuration. [Reference]

Alternative remediations are investigated by AKS engineers if auto-repair is unsuccessful.


0 Votes 0 ·
KaplingatNikhil-7766 avatar image
0 Votes"
KaplingatNikhil-7766 answered

I too was looking for this info. I just forced a node removal from my AKS cluster by running the command "az vmss deallocate". The node is expectedly shown by kubectl command as "NotReady". But the node has not come back even after 30 minutes. Looks like AKS node auto-repair did not work in this case.

Please let me know once AKS team finds a predictable way to simulate the scenario which would kick in the node repair feature.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.