Configure GPU monitoring with Container insights
Starting with agent version ciprod03022019, Container insights integrated agent now supports monitoring GPU (graphical processing units) usage on GPU-aware Kubernetes cluster nodes, and monitor pods/containers requesting and using GPU resources.
Supported GPU vendors
Container insights supports monitoring GPU clusters from following GPU vendors:
Container insights automatically starts monitoring GPU usage on nodes, and GPU requesting pods and workloads by collecting the following metrics at 60sec intervals and storing them in the InsightMetrics table.
After provisioning cluster with GPU nodes, ensure that GPU driver is installed as required by AKS to run GPU workloads. Container insights collect GPU metrics through GPU driver pods running in the node.
|Metric name||Metric dimension (tags)||Description|
|containerGpuDutyCycle||container.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor||Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100.|
|containerGpuLimits||container.azm.ms/clusterId, container.azm.ms/clusterName, containerName||Each container can specify limits as one or more GPUs. It is not possible to request or limit a fraction of a GPU.|
|containerGpuRequests||container.azm.ms/clusterId, container.azm.ms/clusterName, containerName||Each container can request one or more GPUs. It is not possible to request or limit a fraction of a GPU.|
|containerGpumemoryTotalBytes||container.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor||Amount of GPU Memory in bytes available to use for a specific container.|
|containerGpumemoryUsedBytes||container.azm.ms/clusterId, container.azm.ms/clusterName, containerName, gpuId, gpuModel, gpuVendor||Amount of GPU Memory in bytes used by a specific container.|
|nodeGpuAllocatable||container.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor||Number of GPUs in a node that can be used by Kubernetes.|
|nodeGpuCapacity||container.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor||Total Number of GPUs in a node.|
GPU performance charts
Container insights includes pre-configured charts for the metrics listed earlier in the table as a GPU workbook for every cluster. See Workbooks in Container insights for a description of the workbooks available for Container insights.
See Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS) to learn how to deploy an AKS cluster that includes GPU-enabled nodes.
Learn more about GPU Optimized VM SKUs in Microsoft Azure.
Review GPU support in Kubernetes to learn more about Kubernetes experimental support for managing GPUs across one or more nodes in a cluster.