Use RDMA or GPU instances in Batch pools

To run certain Batch jobs, you can take advantage of Azure VM sizes designed for large-scale computation. For example:

  • To run multi-instance MPI workloads, choose H-series or other sizes that have a network interface for Remote Direct Memory Access (RDMA). These sizes connect to an InfiniBand network for inter-node communication, which can accelerate MPI applications.

  • For CUDA applications, choose N-series sizes that include NVIDIA Tesla graphics processing unit (GPU) cards.

This article provides guidance and examples to use some of Azure's specialized sizes in Batch pools. For specs and background, see:

Note

Certain VM sizes might not be available in the regions where you create your Batch accounts. To check that a size is available, see Products available by region and Choose a VM size for a Batch pool.

Dependencies

The RDMA or GPU capabilities of compute-intensive sizes in Batch are supported only in certain operating systems. (The list of supported operating systems is a subset of those supported for virtual machines created in these sizes.) Depending on how you create your Batch pool, you might need to install or configure additional driver or other software on the nodes. The following tables summarize these dependencies. See linked articles for details. For options to configure Batch pools, see later in this article.

Linux pools - Virtual machine configuration

Size Capability Operating systems Required software Pool settings
H16r, H16mr, A8, A9
NC24r, NC24rs_v2, NC24rs_v3, ND24rs*
RDMA Ubuntu 16.04 LTS, or
CentOS-based HPC
(Azure Marketplace)
Intel MPI 5

Linux RDMA drivers
Enable inter-node communication, disable concurrent task execution
NC, NCv2, NCv3, NDv2 series NVIDIA Tesla GPU (varies by series) Ubuntu 16.04 LTS, or
CentOS 7.3 or 7.4
(Azure Marketplace)
NVIDIA CUDA or CUDA Toolkit drivers N/A
NV, NVv2 series NVIDIA Tesla M60 GPU Ubuntu 16.04 LTS, or
CentOS 7.3
(Azure Marketplace)
NVIDIA GRID drivers N/A

*RDMA-capable N-series sizes also include NVIDIA Tesla GPUs

Windows pools - Virtual machine configuration

Size Capability Operating systems Required software Pool settings
H16r, H16mr, A8, A9
NC24r, NC24rs_v2, NC24rs_v3, ND24rs*
RDMA Windows Server 2016, 2012 R2, or
2012 (Azure Marketplace)
Microsoft MPI 2012 R2 or later, or
Intel MPI 5

Windows RDMA drivers
Enable inter-node communication, disable concurrent task execution
NC, NCv2, NCv3, ND, NDv2 series NVIDIA Tesla GPU (varies by series) Windows Server 2016 or
2012 R2 (Azure Marketplace)
NVIDIA CUDA or CUDA Toolkit drivers N/A
NV, NVv2 series NVIDIA Tesla M60 GPU Windows Server 2016 or
2012 R2 (Azure Marketplace)
NVIDIA GRID drivers N/A

*RDMA-capable N-series sizes also include NVIDIA Tesla GPUs

Windows pools - Cloud services configuration

Note

N-series sizes are not supported in Batch pools with the Cloud Service configuration.

Size Capability Operating systems Required software Pool settings
H16r, H16mr, A8, A9 RDMA Windows Server 2016, 2012 R2, 2012, or
2008 R2 (Guest OS family)
Microsoft MPI 2012 R2 or later, or
Intel MPI 5

Windows RDMA drivers
Enable inter-node communication,
disable concurrent task execution

Pool configuration options

To configure a specialized VM size for your Batch pool, you have several options to install required software or drivers:

  • For pools in the virtual machine configuration, choose a preconfigured Azure Marketplace VM image that has drivers and software preinstalled. Examples:

  • Create a custom Windows or Linux VM image on which you have installed drivers, software, or other settings required for the VM size.

  • Create a Batch application package from a zipped driver or application installer, and configure Batch to deploy the package to pool nodes and install once when each node is created. For example, if the application package is an installer, create a start task command line to silently install the app on all pool nodes. Consider using an application package and a pool start task if your workload depends on a particular driver version.

    Note

    The start task must run with elevated (admin) permissions, and it must wait for success. Long-running tasks will increase the time to provision a Batch pool.

  • Batch Shipyard automatically configures the GPU and RDMA drivers to work transparently with containerized workloads on Azure Batch. Batch Shipyard is entirely driven with configuration files. There are many sample recipe configurations available that enable GPU and RDMA workloads such as the CNTK GPU Recipe which preconfigures GPU drivers on N-series VMs and loads Microsoft Cognitive Toolkit software as a Docker image.

Example: NVIDIA GPU drivers on Windows NC VM pool

To run CUDA applications on a pool of Windows NC nodes, you need to install NVDIA GPU drivers. The following sample steps use an application package to install the NVIDIA GPU drivers. You might choose this option if your workload depends on a specific GPU driver version.

  1. Download a setup package for the GPU drivers on Windows Server 2016 from the NVIDIA website - for example, version 411.82. Save the file locally using a short name like GPUDriverSetup.exe.
  2. Create a zip file of the package.
  3. Upload the package to your Batch account. For steps, see the application packages guidance. Specify an application id such as GPUDriver, and a version such as 411.82.
  4. Using the Batch APIs or Azure portal, create a pool in the virtual machine configuration with the desired number of nodes and scale. The following table shows sample settings to install the NVIDIA GPU drivers silently using a start task:
Setting Value
Image Type Marketplace (Linux/Windows)
Publisher MicrosoftWindowsServer
Offer WindowsServer
Sku 2016-Datacenter
Node size NC6 Standard
Application package references GPUDriver, version 411.82
Start task enabled True
Command line - cmd /c "%AZ_BATCH_APP_PACKAGE_GPUDriver#411.82%\\GPUDriverSetup.exe /s"
User identity - Pool autouser, admin
Wait for success - True

Example: NVIDIA GPU drivers on a Linux NC VM pool

To run CUDA applications on a pool of Linux NC nodes, you need to install necessary NVIDIA Tesla GPU drivers from the CUDA Toolkit. The following sample steps create and deploy a custom Ubuntu 16.04 LTS image with the GPU drivers:

  1. Deploy an Azure NC-series VM running Ubuntu 16.04 LTS. For example, create the VM in the US South Central region.
  2. Add the NVIDIA GPU Drivers extension to the VM by using the Azure portal, a client computer that connects to the Azure subscription, or Azure Cloud Shell. Alternatively, follow the steps to connect to the VM and install CUDA drivers manually.
  3. Follow the steps to create a Shared Image Gallery image for Batch.
  4. Create a Batch account in a region that supports NC VMs.
  5. Using the Batch APIs or Azure portal, create a pool using the custom image and with the desired number of nodes and scale. The following table shows sample pool settings for the image:
Setting Value
Image Type Custom Image
Custom Image Name of the image
Node agent SKU batch.node.ubuntu 16.04
Node size NC6 Standard

Example: Microsoft MPI on a Windows H16r VM pool

To run Windows MPI applications on a pool of Azure H16r VM nodes, you need to configure the HpcVmDrivers extension and install Microsoft MPI. Here are sample steps to deploy a custom Windows Server 2016 image with the necessary drivers and software:

  1. Deploy an Azure H16r VM running Windows Server 2016. For example, create the VM in the US West region.
  2. Add the HpcVmDrivers extension to the VM by running an Azure PowerShell command from a client computer that connects to your Azure subscription, or using Azure Cloud Shell.
  3. Make a Remote Desktop connection to the VM.
  4. Download the setup package (MSMpiSetup.exe) for the latest version of Microsoft MPI, and install Microsoft MPI.
  5. Follow the steps to create a Shared Image Gallery image for Batch.
  6. Using the Batch APIs or Azure portal, create a pool using the Shared Image Gallery and with the desired number of nodes and scale. The following table shows sample pool settings for the image:
Setting Value
Image Type Custom Image
Custom Image Name of the image
Node agent SKU batch.node.windows amd64
Node size H16r Standard
Internode communication enabled True
Max tasks per node 1

Example: Intel MPI on a Linux H16r VM pool

To run MPI applications on a pool of Linux H-series nodes, one option is to use the CentOS-based 7.4 HPC image from the Azure Marketplace. Linux RDMA drivers and Intel MPI are preinstalled. This image also supports Docker container workloads.

Using the Batch APIs or Azure portal, create a pool using this image and with the desired number of nodes and scale. The following table shows sample pool settings:

Setting Value
Image Type Marketplace (Linux/Windows)
Publisher OpenLogic
Offer CentOS-HPC
Sku 7.4
Node size H16r Standard
Internode communication enabled True
Max tasks per node 1

Next steps

  • To run MPI jobs on an Azure Batch pool, see the Windows or Linux examples.

  • For examples of GPU workloads on Batch, see the Batch Shipyard recipes.