Troubleshoot node not ready failures caused by CSE errors

Article
04/10/2024

This article helps you troubleshoot scenarios in which a Microsoft Azure Kubernetes Service (AKS) cluster isn't in the Succeeded state and an AKS node isn't ready within a node pool because of custom script extension (CSE) errors.

Prerequisites

Azure CLI

Symptoms

Because of CSE errors, an AKS cluster node isn't ready within a node pool, and the AKS cluster isn't in the Succeeded state.

Cause

The node extension deployment fails and returns more than one error code when you provision the kubelet and other components. This is the most common cause of errors. To verify that the node extension deployment is failing when you provision the kubelet, follow these steps:

To better understand the current failure on the cluster, run the az aks show and az resource update commands to set up debugging:

clusterResourceId=$(az aks show \
    --resource-group <resource-group-name> --name <cluster-name> --output tsv --query id)
az resource update --debug --verbose --ids $clusterResourceId

Check the debugging output and the error messages that you received from the az resource update command against the error list in the CSE helper executable file on GitHub.

If any of the errors involve the CSE deployment of the kubelet, then you've verified that the scenario that's described here is the cause of the Node Not Ready failure.

In general, exit codes identify the specific issue that's causing the failure. For example, you'll see messages such as "Unable to communicate with API server" or "Unable to connect to internet." Or the exit codes might alert you to API network time-outs, or a node fault that needs a replacement.

Solution 1: Make sure your custom DNS server is configured correctly

Set up your custom Domain Name System (DNS) server so that it can do name resolution correctly. Configure the server to meet the following requirements:

If you're using custom DNS servers, make sure that the servers are healthy and reachable over the network.
Make sure that custom DNS servers have the required conditional forwarders to the Azure DNS IP address (or the forwarder to that address).
Make sure that your private AKS DNS zone is linked to your custom DNS virtual networks if they're hosted on Azure.
Don't use the Azure DNS IP address with the IP addresses of your custom DNS server. Doing this isn't recommended.

Avoid using IP addresses instead of the DNS server in DNS settings. You can use Azure CLI commands to check for this situation on a virtual machine (VM) scale set or availability set.

For VM scale set nodes, use the az vmss run-command invoke command:

az vmss run-command invoke \
    --resource-group <resource-group-name> \
    --name <vm-scale-set-name> \
    --command-id RunShellScript \
    --instance-id 0 \
    --output tsv \
    --query "value[0].message" \
    --scripts "telnet <dns-ip-address> 53"
az vmss run-command invoke \
    --resource-group <resource-group-name> \
    --name <vm-scale-set-name> \
    --instance-id 0 \
    --command-id RunShellScript \
    --output tsv \
    --query "value[0].message" \
    --scripts "nslookup <api-fqdn> <dns-ip-address>"

For VM availability set nodes, use the az vm run-command invoke command:

az vm run-command invoke \
    --resource-group <resource-group-name> \
    --name <vm-availability-set-name> \
    --command-id RunShellScript \
    --output tsv \
    --query "value[0].message" \
    --scripts "telnet <dns-ip-address> 53"
az vm run-command invoke \
    --resource-group <resource-group-name> \
    --name <vm-availability-set-name> \
    --command-id RunShellScript \
    --output tsv \
    --query "value[0].message" \
    --scripts "nslookup <api-fqdn> <dns-ip-address>"

For more information, see Name resolution for resources in Azure virtual networks and Hub and spoke with custom DNS.

Solution 2: Fix API network time-outs

Make sure that the API server can be reached and isn't subject to delays. To do this, follow these steps:

Check the AKS subnet to see whether the assigned network security group (NSG) is blocking the egress traffic port 443 to the API server.
Check the node itself to see whether the node has another NSG that's blocking the traffic.
Check the AKS subnet for any assigned route table. If a route table has a network virtual appliance (NVA) or firewall, make sure that port 443 is available for egress traffic. For more information, see Control egress traffic for cluster nodes in AKS.

If the DNS resolves names successfully and the API is reachable, but the node CSE failed because of an API time-out, take the appropriate action as shown in the following table.

Set type	Action
VM availability set	Delete the node from the Azure portal and the AKS API by using the kubectl delete node command, and then scale up the cluster again.
VM scale set	Either reimage the node, or delete the node, and then scale up the cluster again.

If the requests are being throttled by the AKS API server, upgrade to a higher service tier. For more information, see AKS Uptime SLA.

More information

For general troubleshooting steps, see Basic troubleshooting of Node Not Ready failures.