Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)

Article
01/30/2024

Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.

This article helps you provision nodes with schedulable GPUs on new and existing AKS clusters.

Supported GPU-enabled VMs

To view supported GPU-enabled VMs, see GPU-optimized VM sizes in Azure. For AKS node pools, we recommend a minimum size of Standard_NC6s_v3. The NVv4 series (based on AMD GPUs) aren't supported on AKS.

Note

GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the pricing tool and region availability.

Limitations

If you're using an Azure Linux GPU-enabled node pool, automatic security patches aren't applied, and the default behavior for the cluster is Unmanaged. For more information, see auto-upgrade.
NVadsA10 v5-series are not a recommended SKU for GPU VHD.
Updating an existing node pool to add GPU isn't supported.

Before you begin

This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
You need the Azure CLI version 2.0.64 or later installed and configured. Run az --version to find the version. If you need to install or upgrade, see Install Azure CLI.

Get the credentials for your cluster

Get the credentials for your AKS cluster using the az aks get-credentials command. The following example command gets the credentials for the myAKSCluster in the myResourceGroup resource group:
```
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
```

Options for using NVIDIA GPUs

Using NVIDIA GPUs involves the installation of various NVIDIA software components such as the NVIDIA device plugin for Kubernetes, GPU driver installation, and more.

Skip GPU driver installation (preview)

AKS has automatic GPU driver installation enabled by default. In some cases, such as installing your own drivers or using the NVIDIA GPU Operator, you may want to skip GPU driver installation.

Important

AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:

# Register the aks-preview extension
az extension add --name aks-preview

# Update the aks-preview extension
az extension update --name aks-preview

Create a node pool using the az aks nodepool add command with the --skip-gpu-driver-install flag to skip automatic GPU driver installation.
```
az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --skip-gpu-driver-install \
    --node-vm-size Standard_NC6s_v3 \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3
```
Adding the --skip-gpu-driver-install flag during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.

NVIDIA device plugin installation

NVIDIA device plugin installation is required when using GPUs on AKS. In some cases, the installation is handled automatically, such as when using the NVIDIA GPU Operator or the AKS GPU image (preview). Alternatively, you can manually install the NVIDIA device plugin.

Manually install the NVIDIA device plugin

You can deploy a DaemonSet for the NVIDIA device plugin, which runs a pod on each node to provide the required drivers for the GPUs. This is the recommended approach when using GPU-enabled node pools for Azure Linux.

Ubuntu Linux node pool (default SKU)
Azure Linux node pool

To use the default OS SKU, you create the node pool without specifying an OS SKU. The node pool is configured for the default operating system based on the Kubernetes version of the cluster.

Add a node pool to your cluster using the az aks nodepool add command.
```
az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3
```
This command adds a node pool named gpunp to myAKSCluster in myResourceGroup and uses parameters to configure the following node pool settings:
- --node-vm-size: Sets the VM size for the node in the node pool to Standard_NC6s_v3.
- --node-taints: Specifies a sku=gpu:NoSchedule taint on the node pool.
- --enable-cluster-autoscaler: Enables the cluster autoscaler.
- --min-count: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
- --max-count: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
Note

Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.

To use Azure Linux, you specify the OS SKU by setting os-sku to AzureLinux during node pool creation. The os-type is set to Linux by default.

Add a node pool to your cluster using the az aks nodepool add command with the --os-sku flag set to AzureLinux.
```
az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --os-sku AzureLinux \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3
```
This command adds a node pool named gpunp to myAKSCluster in myResourceGroup and uses parameters to configure the following node pool settings:
- --node-vm-size: Sets the VM size for the node in the node pool to Standard_NC6s_v3.
- --node-taints: Specifies a sku=gpu:NoSchedule taint on the node pool.
- --enable-cluster-autoscaler: Enables the cluster autoscaler.
- --min-count: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
- --max-count: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
Note

Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time. Certain SKUs, including A100 and H100 VM SKUs, aren't available for Azure Linux. For more information, see GPU-optimized VM sizes in Azure.

Create a namespace using the kubectl create namespace command.
```
kubectl create namespace gpu-resources
```

Create a file named nvidia-device-plugin-ds.yaml and paste the following YAML manifest provided as part of the NVIDIA device plugin for Kubernetes project:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      containers:
      - image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the kubectl apply command.
```
kubectl apply -f nvidia-device-plugin-ds.yaml
```
Now that you successfully installed the NVIDIA device plugin, you can check that your GPUs are schedulable and run a GPU workload.

Use NVIDIA GPU Operator with AKS

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPU including driver installation, the NVIDIA device plugin for Kubernetes, the NVIDIA container runtime, and more. Since the GPU Operator handles these components, it's not necessary to manually install the NVIDIA device plugin. This also means that the automatic GPU driver installation on AKS is no longer required.

Skip automatic GPU driver installation by creating a node pool using the az aks nodepool add command with --skip-gpu-driver-install. Adding the --skip-gpu-driver-install flag during node pool creation skips the automatic GPU driver installation. Any existing nodes aren't changed. You can scale the node pool to zero and then back up to make the change take effect.
Follow the NVIDIA documentation to Install the GPU Operator.
Now that you successfully installed the GPU Operator, you can check that your GPUs are schedulable and run a GPU workload.

Warning

We don't recommend manually installing the NVIDIA device plugin daemon set with clusters using the AKS GPU image.

Use the AKS GPU image (preview)

AKS provides a fully configured AKS image containing the NVIDIA device plugin for Kubernetes. The AKS GPU image is currently only supported for Ubuntu 18.04.

Important

Install the aks-preview Azure CLI extension using the az extension add command.
```
az extension add --name aks-preview
```
Update to the latest version of the extension using the az extension update command.
```
az extension update --name aks-preview
```
Register the GPUDedicatedVHDPreview feature flag using the az feature register command.
```
az feature register --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"
```
It takes a few minutes for the status to show Registered.

Verify the registration status using the az feature show command.

az feature show --namespace "Microsoft.ContainerService" --name "GPUDedicatedVHDPreview"

When the status reflects Registered, refresh the registration of the Microsoft.ContainerService resource provider using the az provider register command.
```
az provider register --namespace Microsoft.ContainerService
```
Now that you updated your cluster to use the AKS GPU image, you can add a node pool for GPU nodes to your cluster.
Add a node pool using the az aks nodepool add command.
```
az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --aks-custom-headers UseGPUDedicatedVHD=true \
    --enable-cluster-autoscaler \
    --min-count 1 \
    --max-count 3
```
The previous example command adds a node pool named gpunp to myAKSCluster in myResourceGroup and uses parameters to configure the following node pool settings:
- --node-vm-size: Sets the VM size for the node in the node pool to Standard_NC6s_v3.
- --node-taints: Specifies a sku=gpu:NoSchedule taint on the node pool.
- --aks-custom-headers: Specifies a specialized AKS GPU image, UseGPUDedicatedVHD=true. If your GPU sku requires generation 2 VMs, use --aks-custom-headers UseGPUDedicatedVHD=true,usegen2vm=true instead.
- --enable-cluster-autoscaler: Enables the cluster autoscaler.
- --min-count: Configures the cluster autoscaler to maintain a minimum of one node in the node pool.
- --max-count: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.
Note

Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
Now that you successfully created a node pool using the GPU image, you can check that your GPUs are schedulable and run a GPU workload.

Confirm that GPUs are schedulable

After creating your cluster, confirm that GPUs are schedulable in Kubernetes.

List the nodes in your cluster using the kubectl get nodes command.
```
kubectl get nodes
```
Your output should look similar to the following example output:
```
NAME                   STATUS   ROLES   AGE   VERSION
aks-gpunp-28993262-0   Ready    agent   13m   v1.20.7
```

Confirm the GPUs are schedulable using the kubectl describe node command.

kubectl describe node aks-gpunp-28993262-0

Under the Capacity section, the GPU should list as nvidia.com/gpu: 1. Your output should look similar to the following condensed example output:

Name:               aks-gpunp-28993262-0
Roles:              agent
Labels:             accelerator=nvidia

[...]

Capacity:
[...]
 nvidia.com/gpu:                 1
[...]

Run a GPU-enabled workload

To see the GPU in action, you can schedule a GPU-enabled workload with the appropriate resource request. In this example, we'll run a Tensorflow job against the MNIST dataset.

Create a file named samples-tf-mnist-demo.yaml and paste the following YAML manifest, which includes a resource limit of nvidia.com/gpu: 1:

Note

If you receive a version mismatch error when calling into drivers, such as "CUDA driver version is insufficient for CUDA runtime version", review the NVIDIA driver matrix compatibility chart.

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    app: samples-tf-mnist-demo
  name: samples-tf-mnist-demo
spec:
  template:
    metadata:
      labels:
        app: samples-tf-mnist-demo
    spec:
      containers:
      - name: samples-tf-mnist-demo
        image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
        args: ["--max_steps", "500"]
        imagePullPolicy: IfNotPresent
        resources:
          limits:
           nvidia.com/gpu: 1
      restartPolicy: OnFailure
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

Run the job using the kubectl apply command, which parses the manifest file and creates the defined Kubernetes objects.
```
kubectl apply -f samples-tf-mnist-demo.yaml
```

View the status of the GPU-enabled workload

Monitor the progress of the job using the kubectl get jobs command with the --watch flag. It may take a few minutes to first pull the image and process the dataset.
```
kubectl get jobs samples-tf-mnist-demo --watch
```
When the COMPLETIONS column shows 1/1, the job has successfully finished, as shown in the following example output:
```
NAME                    COMPLETIONS   DURATION   AGE

samples-tf-mnist-demo   0/1           3m29s      3m29s
samples-tf-mnist-demo   1/1   3m10s   3m36s
```
Exit the kubectl --watch process with Ctrl-C.
Get the name of the pod using the kubectl get pods command.
```
kubectl get pods --selector app=samples-tf-mnist-demo
```

View the output of the GPU-enabled workload using the kubectl logs command.

kubectl logs samples-tf-mnist-demo-smnr6

The following condensed example output of the pod logs confirms that the appropriate GPU device, Tesla K80, has been discovered:

2019-05-16 16:08:31.258328: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 2fd7:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-05-16 16:08:31.396886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 2fd7:00:00.0, compute capability: 3.7)
2019-05-16 16:08:36.076962: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1081
Accuracy at step 10: 0.7457
Accuracy at step 20: 0.8233
Accuracy at step 30: 0.8644
Accuracy at step 40: 0.8848
Accuracy at step 50: 0.8889
Accuracy at step 60: 0.8898
Accuracy at step 70: 0.8979
Accuracy at step 80: 0.9087
Accuracy at step 90: 0.9099
Adding run metadata for 99
Accuracy at step 100: 0.9125
Accuracy at step 110: 0.9184
Accuracy at step 120: 0.922
Accuracy at step 130: 0.9161
Accuracy at step 140: 0.9219
Accuracy at step 150: 0.9151
Accuracy at step 160: 0.9199
Accuracy at step 170: 0.9305
Accuracy at step 180: 0.9251
Accuracy at step 190: 0.9258
Adding run metadata for 199
[...]
Adding run metadata for 499

Use Container Insights to monitor GPU usage

Container Insights with AKS monitors the following GPU usage metrics:

Metric name	Metric dimension (tags)	Description
containerGpuDutyCycle	`container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor`	Percentage of time over the past sample period (60 seconds) during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100.
containerGpuLimits	`container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`	Each container can specify limits as one or more GPUs. It is not possible to request or limit a fraction of a GPU.
containerGpuRequests	`container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`	Each container can request one or more GPUs. It is not possible to request or limit a fraction of a GPU.
containerGpumemoryTotalBytes	`container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor`	Amount of GPU Memory in bytes available to use for a specific container.
containerGpumemoryUsedBytes	`container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `containerName`, `gpuId`, `gpuModel`, `gpuVendor`	Amount of GPU Memory in bytes used by a specific container.
nodeGpuAllocatable	`container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `gpuVendor`	Number of GPUs in a node that can be used by Kubernetes.
nodeGpuCapacity	`container.azm.ms/clusterId`, `container.azm.ms/clusterName`, `gpuVendor`	Total Number of GPUs in a node.

Clean up resources

Remove the associated Kubernetes objects you created in this article using the kubectl delete job command.
```
kubectl delete jobs samples-tf-mnist-demo
```

Next steps

To run Apache Spark jobs, see Run Apache Spark jobs on AKS.
For more information on features of the Kubernetes scheduler, see Best practices for advanced scheduler features in AKS.
For more information on Azure Kubernetes Service and Azure Machine Learning, see: