部署使用 GPU 資源的容器實例

發行項
03/20/2024

若要在 Azure 容器執行個體上執行某些計算密集型工作負載，請使用 GPU 資源部署您的容器群組。群組中的容器實例可以在執行 CUDA 和深度學習應用程式等容器工作負載時存取一或多個 NVIDIA Tesla GPU。

本文說明如何使用 YAML 檔案或 Resource Manager 範本來部署容器群組時新增 GPU 資源。您也可以在使用 Azure 入口網站部署容器實例時指定 GPU 資源。

重要

K80 和 P100 GPU SKU 將於 2023 年 8 月 31 日前淘汰。這是因為已使用的基礎 VM 淘汰： NC 系列和 NCv2 系列雖然可以使用 V100 SKU，但會改用 Azure Kubernetes Service。 GPU 資源不受完全支援，不應用於生產工作負載。使用下列資源立即移轉至 AKS：如何移轉至 AKS。

重要

這項功能目前處於預覽狀態，並適用一些限制。若您同意補充的使用規定即可取得預覽。在公開上市 (GA) 之前，此功能的某些領域可能會變更。

必要條件

注意

由於某些目前的限制，並非所有限制增加要求都會獲得核准。

如果您想要將此 SKU 用於生產容器部署，請建立 Azure 支援要求以增加限制。

預覽限制

在預覽版中，在容器群組中使用 GPU 資源時，適用下列限制。

區域可用性

地區	OS	可用的 GPU SKU
美國東部、西歐、美國西部 2、東南亞、印度中部	Linux	V100

將會隨著時間新增其他區域的支援。

支援的 OS 類型：僅限 Linux

其他限制：將容器群組部署至虛擬網路時，無法使用 GPU 資源。

關於 GPU 資源

計數和 SKU

若要在容器實例中使用 GPU，請使用下列資訊指定 GPU 資源 ：

Count - GPU 數目： 1、 2 或 4。
SKU - GPU SKU： V100。每個 SKU 都會對應至下列其中一個已啟用 Azure GPU 的 VM 系列中的 NVIDIA Tesla GPU：

SKU VM 系列

V100 NCv3

SKU	VM 系列
V100	NCv3

每個 SKU 的資源上限

OS	GPU SKU	GPU 計數	最大 CPU	最大記憶體（GB）	儲存體 (GB)
Linux	V100	1	6	112	50
Linux	V100	2	12	224	50
Linux	V100	4	24	448	50

部署 GPU 資源時，請設定適用於工作負載的 CPU 和記憶體資源，上限為上表所示的最大值。這些值目前大於容器群組中沒有 GPU 資源的 CPU 和記憶體資源。

重要

GPU 資源的預設訂用帳戶限制（配額）與 SKU 不同。 V100 SKU 的預設CPU限制一開始設定為0。若要要求增加可用區域，請提交 Azure 支援要求。

須知事項

部署時間 - 建立包含 GPU 資源的容器群組最多 需要 8-10 分鐘的時間。這是因為在 Azure 中布建和設定 GPU VM 的額外時間。
定價 - 類似於沒有 GPU 資源的容器群組，Azure 會針對在具有 GPU 資源的容器群組期間耗用的資源計費。持續時間是從提取第一個容器映射的時間計算，直到容器群組終止為止。它不包含部署容器群組的時間。

參閱定價詳細資料。
CUDA 驅動程式 - 使用 NVIDIA CUDA 驅動程式和容器運行時間預先佈建具有 GPU 資源的容器實例，因此您可以使用針對 CUDA 工作負載開發的容器映像。

我們在此階段支援 CUDA 11。例如，您可以針對 Dockerfile 使用下列基底映射：
- nvidia/cuda：11.4.2-base-ubuntu20.04
- tensorflow/tensorflow：devel-gpu
注意

若要改善從 Docker Hub 使用公用容器映射時的可靠性，請在私人 Azure 容器登錄中匯入和管理映像，並更新您的 Dockerfile 以使用私人管理的基底映像。深入了解使用公用映像。

YAML 範例

新增 GPU 資源的其中一種方式是使用 YAML 檔案來部署容器群組。將下列 YAML 複製到名為 gpu-deploy-aci.yaml 的新檔案中，然後儲存盤案。此 YAML 會建立名為 gpucontainergroup 的容器群組，以指定具有 V100 GPU 的容器實例。實例會執行範例 CUDA 向量加法應用程式。資源要求足以執行工作負載。

注意

下列範例使用公用容器映像。若要改善可靠性，請在私人 Azure 容器登錄中匯入和管理映射，並更新 YAML 以使用私人管理的基底映像。深入了解使用公用映像。

additional_properties: {}
apiVersion: '2021-09-01'
name: gpucontainergroup
properties:
  containers:
  - name: gpucontainer
    properties:
      image: k8s-gcrio.azureedge.net/cuda-vector-add:v0.1
      resources:
        requests:
          cpu: 1.0
          memoryInGB: 1.5
          gpu:
            count: 1
            sku: V100
  osType: Linux
  restartPolicy: OnFailure

使用 az container create 命令部署容器群組，並指定參數的 --file YAML 檔名。您必須提供資源群組的名稱，以及支援 GPU 資源的 eastus 等容器群組的位置。

az container create --resource-group myResourceGroup --file gpu-deploy-aci.yaml --location eastus

部署需要數分鐘才能完成。然後，容器會啟動並執行 CUDA 向量加法作業。執行 az container logs 命令以檢視記錄輸出：

az container logs --resource-group myResourceGroup --name gpucontainergroup --container-name gpucontainer

輸出：

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Resource Manager 範例

使用 GPU 資源部署容器群組的另一種方式是使用 Resource Manager 範本。從建立名為 gpudeploy.json的檔案開始，然後將下列 JSON 複製到其中。此範例會部署具有 V100 GPU 的容器實例，以針對 MNIST 數據集執行 TensorFlow 定型作業。資源要求足以執行工作負載。

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
      "containerGroupName": {
        "type": "string",
        "defaultValue": "gpucontainergrouprm",
        "metadata": {
          "description": "Container Group name."
        }
      }
    },
    "variables": {
      "containername": "gpucontainer",
      "containerimage": "mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu"
    },
    "resources": [
      {
        "name": "[parameters('containerGroupName')]",
        "type": "Microsoft.ContainerInstance/containerGroups",
        "apiVersion": "2021-09-01",
        "location": "[resourceGroup().location]",
        "properties": {
            "containers": [
            {
              "name": "[variables('containername')]",
              "properties": {
                "image": "[variables('containerimage')]",
                "resources": {
                  "requests": {
                    "cpu": 4.0,
                    "memoryInGb": 12.0,
                    "gpu": {
                        "count": 1,
                        "sku": "V100"
                  }
                }
              }
            }
          }
        ],
        "osType": "Linux",
        "restartPolicy": "OnFailure"
        }
      }
    ]
}

使用 az deployment group create 命令來部署範本。您必須提供在支援 GPU 資源的 eastus 等區域中建立的資源群組名稱。

az deployment group create --resource-group myResourceGroup --template-file gpudeploy.json

部署需要數分鐘才能完成。然後，容器會啟動並執行 TensorFlow 作業。執行 az container logs 命令以檢視記錄輸出：

az container logs --resource-group myResourceGroup --name gpucontainergrouprm --container-name gpucontainer

輸出：

2018-10-25 18:31:10.155010: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-10-25 18:31:10.305937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla V100 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: ccb6:00:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-10-25 18:31:10.305981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100, pci bus id: ccb6:00:00.0, compute capability: 3.7)
2018-10-25 18:31:14.941723: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.097
Accuracy at step 10: 0.6993
Accuracy at step 20: 0.8208
Accuracy at step 30: 0.8594
...
Accuracy at step 990: 0.969
Adding run metadata for 999

清除資源

由於使用 GPU 資源可能很昂貴，因此請確定您的容器不會長時間非預期地執行。在 Azure 入口網站中監視您的容器，或使用 az container show 命令檢查容器群組的狀態。例如：

az container show --resource-group myResourceGroup --name gpucontainergroup --output table

當您完成使用您所建立的容器實例時，請使用下列命令加以刪除：

az container delete --resource-group myResourceGroup --name gpucontainergroup -y
az container delete --resource-group myResourceGroup --name gpucontainergrouprm -y

下一步

深入瞭解如何使用 YAML 檔案或 Resource Manager 範本部署容器群組。
深入瞭解 Azure 中的 GPU 優化 VM 大小。

Share via