Troubleshoot memory saturation in AKS clusters

Article
04/10/2024

This article discusses methods for troubleshooting memory saturation issues. Memory saturation occurs if at least one application or process needs more memory than a container host can provide, or if the host exhausts its available memory.

Prerequisites

The Kubernetes kubectl command-line tool. To install kubectl by using Azure CLI, run the az aks install-cli command.

Symptoms

The following table outlines the common symptoms of memory saturation.

Symptom	Description
Unschedulable pods	Additional pods can't be scheduled if the node is close to its set memory limit.
Pod eviction	If a node is running out of memory, the kubelet can evict pods. Although the control plane tries to reschedule the evicted pods on other nodes that have resources, there's no guarantee that other nodes have sufficient memory to run these pods.
Node not ready	Memory saturation can cause `kubelet` and `containerd` to become unresponsive, eventually causing node readiness issues.
Out-of-memory (OOM) kill	An OOM problem occurs if the pod eviction can't prevent a node issue.

Troubleshooting checklist

To reduce memory saturation, use effective monitoring tools and apply best practices.

Step 1: Identify nodes that have memory saturation

Use either of the following methods to identify nodes that have memory saturation:

In a web browser, use the Container Insights feature of AKS in the Azure portal.
In a console, use the Kubernetes command-line tool (kubectl).

Browser
Command Line

Container Insights is a feature within AKS that monitors container workload performance. For more information, see Enable Container insights for Azure Kubernetes Service (AKS) cluster.

On the Azure portal, search for and select Kubernetes services.
In the list of Kubernetes services, select the name of your cluster.
In the navigation pane of your cluster, find the Monitoring heading, and then select Insights.
Set the appropriate Time Range value.
Select the Nodes tab.
In the Metric list, select Memory working set (computed from Allocatable).
In the percentiles selector, set the sample to Max, and then select the Max % column label two times. This action sorts the table nodes by the maximum percentage of memory used, from highest to lowest.

There are four rows in the table, and they represent four nodes in an AKS agent pool virtual machine scale set. The statuses are all **Ok**, the maximum percentage of memory used is from 64 to 58 percent, the maximum memory used is from 2.6 GB to 2.86 GB, the number of containers used is 20 to 24, and the uptime spans 6 to 15 days. No controllers are listed.
Because the first node has the highest memory usage, select that node to investigate the memory usage of the pods that are running on the node.

Nine processes are listed under the node. The statuses are all **Ok**, the maximum percentage of memory used for the processes ranges from 16 to 0.3 percent, the maximum memory used is from 0.7 mc to 22 mc, the number of containers used is 1 to 3, and the uptime is 3 to 4 days. Unlike for the node, the processes all have a corresponding controller listed. In this screenshot, the controller names are prefixes of the process names, and they're hyperlinked.

Note

The percentage of CPU or memory usage for pods is based on the CPU request specified for the container. It doesn't represent the percentage of the CPU or memory usage for the node. So, look at the actual CPU or memory usage rather than the percentage of CPU or memory usage for pods.

This procedure uses the kubectl commands in a console. It displays only the current state of the nodes.

Get the memory usage of the nodes by running the kubectl top node command:

kubectl top node

The output of this command resembles the following text:

NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-agentpool-30486455-vmss000003    239m         12%    3148Mi          69%
aks-agentpool-30486455-vmss000005    326m         17%    2143Mi          46%
aks-testmemory-30616462-vmss000000   66m          3%     1532Mi          28%
aks-testmemory-30616462-vmss000001   90m          4%     1689Mi          31%
aks-testmemory-30616462-vmss000002   74m          3%     1715Mi          31%

Get the list of pods that are running on the node and their memory usage by running the kubectl get pods and kubectl top pods commands:

kubectl get pods --all-namespaces --output wide \
    | grep <node-name> \
    | awk '{print $1" "$2}' \
    | xargs -n2 kubectl top pods --namespace \
    | awk 'NR==1 || NR%2==0' \
    | sort -k3n \
    | column -t

Note

In this code snippet, replace <node-name> with the actual node name.

The output of the code snippet resembles the following text:

NAME                                 CPU(cores)   MEMORY(bytes)
coredns-autoscaler-5655d66f64-9fp2k  1m           7Mi
shippingservice-7946db7679-qzplg     6m           15Mi
azure-ip-masq-agent-tb8xv            1m           16Mi
cloud-node-manager-wggqd             1m           16Mi
kube-proxy-c244z                     1m           22Mi
coredns-59b6bf8b4f-5zg5s             3m           24Mi
coredns-59b6bf8b4f-5x62d             3m           25Mi
currencyservice-7977f668dc-rvbwm     12m          32Mi
csi-azurefile-node-9fcx8             2m           38Mi
metrics-server-5f8d84558d-frsq4      4m           42Mi
metrics-server-5f8d84558d-rc5nj      4m           43Mi
csi-azuredisk-node-9fh7h             2m           46Mi
adservice-795589cf6f-xs66r           4m           87Mi
ama-metrics-node-54sfj               16m          249Mi
ama-logs-rs-6db98d6dff-vj4xw         13m          259Mi
ama-logs-w5bmd                       12m          403Mi

Review the requests and limits for each pod on the node by running the kubectl describe node command:

kubectl describe node <node-name>

The output of this command resembles the following text:

  Namespace    Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------    ----                                 ------------  ----------  ---------------  -------------  ---
  default      adservice-795589cf6f-dgrx7           200m (10%)    300m (15%)  180Mi (3%)       300Mi (6%)     49m
  default      cartservice-6d994d9676-tcr6m         200m (10%)    300m (15%)  64Mi (1%)        128Mi (2%)     49m
  default      frontend-848d9f9dc9-x712b            100m (5%)     200m (10%)  64Mi (1%)        128Mi (2%)     49m
  default      loadgenerator-5c9656f8d6-7vmjr       300m (15%)    500m (26%)  256Mi (5%)       512Mi (11%)    38m
  default      redis-cart-799c85c644-vzpjl          70m (3%)      125m (6%)   200Mi (4%)       256Mi (5%)     49m
  kube-system  ama-logs-zs4qf                       150m (7%)     1 (52%)     550Mi (12%)      1774Mi (38%)   16h
  kube-system  azure-ip-masq-agent-rqqpn            100m (5%)     500m (26%)  50Mi (1%)        250Mi (5%)     16h
  kube-system  cloud-node-manager-nbnrq             50m (2%)      0 (0%)      50Mi (1%)        512Mi (11%)    16h
  kube-system  coredns-59b6bf8b4f-m2prf             100m (5%)     3 (157%)    70Mi (1%)        500Mi (10%)    16h
  kube-system  csi-azuredisk-node-h445m             30m (1%)      0 (0%)      60Mi (1%)        400Mi (8%)     16h
  kube-system  csi-azurefile-node-489cp             30m (1%)      0 (0%)      60Mi (1%)        600Mi (13%)    16h
  kube-system  konnectivity-agent-665c7dfdb8-25p2f  20m (1%)      1 (52%)     20Mi (1%)        1Gi (22%)      15h
  kube-system  kube-proxy-v9gp4                     100m (5%)     0 (0%)      0 (0%)           0 (0%)         16h
Allocated resources:
  ...

Note

The percentage of CPU or memory usage for the node is based on the allocatable resources on the node rather than the actual node capacity.

Now that you've identified the pods that are using high memory, you can identify the applications that are running on the pod.

Step 2: Review best practices to avoid memory saturation

Review the following table to learn how to implement best practices for avoiding memory saturation.

Best practice	Description
Use memory requests and limits	Kubernetes provides options to specify the minimum memory size (request) and the maximum memory size (limit) for a container. By configuring limits on pods, you can avoid memory pressure on the node. Make sure that the aggregate limits for all pods that are running doesn't exceed the node's available memory. This situation is called overcommitting. The Kubernetes scheduler allocates resources based on set requests and limits through Quality of Service (QoS). Without appropriate limits, the scheduler might schedule too many pods on a single node. This might eventually bring down the node. Additionally, while the kubelet is evicting pods, it prioritizes pods in which the memory usage exceeds their defined requests. We recommend that you set the memory request close to the actual usage.
Enable the horizontal pod autoscaler	By scaling the cluster, you can balance the requests across many pods to prevent memory saturation. This technique can reduce the memory footprint on the specific node.
Use anti-affinity tags	For scenarios in which memory is unbounded by design, you can use node selectors and affinity or anti-affinity tags, which can isolate the workload to specific nodes. By using anti-affinity tags, you can prevent other workloads from scheduling pods on these nodes. This reduces the memory saturation problem.
Choose higher SKU VMs	Virtual machines (VMs) that have more random-access memory (RAM) are better suited to handle high memory usage. To use this option, you must create a new node pool, cordon the nodes (make them unschedulable), and drain the existing node pool.
Isolate system and user workloads	We recommend that you run your applications on a user node pool. This configuration makes sure that you can isolate the Kubernetes-specific pods to the system node pool and maintain the cluster performance.

More information

Third-party information disclaimer

The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, about the performance or reliability of these products.

Third-party contact disclaimer

Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.