Install NVIDIA GPU drivers on N-series VMs running Linux

To take advantage of the GPU capabilities of Azure N-series VMs running Linux, NVIDIA graphics drivers must be installed. This article provides driver setup steps after you deploy an N-series VM. Driver setup information is also available for Windows VMs.

For N-series VM specs, storage capacities, and disk details, see GPU Linux VM sizes.

Supported distributions and drivers

NC, NCv2, NCv3, and ND-series - NVIDIA CUDA drivers

CUDA driver information in the following table is current at time of publication. For the latest CUDA drivers, visit the NVIDIA website. Ensure that you install or upgrade to the latest CUDA drivers for your distribution.

Tip

As an alternative to manual CUDA driver installation on a Linux VM, you can deploy an Azure Data Science Virtual Machine image. The DSVM editions for Ubuntu 16.04 LTS or CentOS 7.4 pre-install NVIDIA CUDA drivers, the CUDA Deep Neural Network Library, and other tools.

Distribution Driver
Ubuntu 16.04 LTS

Red Hat Enterprise Linux 7.3 or 7.4

CentOS-based 7.3 or 7.4, CentOS-based 7.4 HPC
NVIDIA CUDA 9.1, driver branch R390

NV-series - NVIDIA GRID drivers

Microsoft redistributes NVIDIA GRID driver installers for NV VMs. Install only these GRID drivers on Azure NV VMs. These drivers include licensing for GRID Virtual GPU Software in Azure.

Distribution Driver
Ubuntu 16.04 LTS

Red Hat Enterprise Linux 7.3 or 7.4

CentOS-based 7.3 or 7.4
NVIDIA GRID 6.0, driver branch R390

Warning

Installation of third-party software on Red Hat products can affect the Red Hat support terms. See the Red Hat Knowledgebase article.

Install CUDA drivers for NC, NCv2, NCv3, and ND-series VMs

Here are steps to install CUDA drivers from the NVIDIA CUDA Toolkit on N-series VMs.

C and C++ developers can optionally install the full Toolkit to build GPU-accelerated applications. For more information, see the CUDA Installation Guide.

To install CUDA drivers, make an SSH connection to each VM. To verify that the system has a CUDA-capable GPU, run the following command:

lspci | grep -i NVIDIA

You will see output similar to the following example (showing an NVIDIA Tesla K80 card):

lspci command output

Then run installation commands specific for your distribution.

Ubuntu 16.04 LTS

  1. Download and install the CUDA drivers.

    CUDA_REPO_PKG=cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
    
    wget -O /tmp/${CUDA_REPO_PKG} http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/${CUDA_REPO_PKG} 
    
    sudo dpkg -i /tmp/${CUDA_REPO_PKG}
    
    sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub 
    
    rm -f /tmp/${CUDA_REPO_PKG}
    
    sudo apt-get update
    
    sudo apt-get install cuda-drivers
    

    The installation can take several minutes.

  2. To optionally install the complete CUDA toolkit, type:

    sudo apt-get install cuda
    
  3. Reboot the VM and proceed to verify the installation.

CUDA driver updates

We recommend that you periodically update CUDA drivers after deployment.

sudo apt-get update

sudo apt-get upgrade -y

sudo apt-get dist-upgrade -y

sudo apt-get install cuda-drivers

sudo reboot

CentOS or Red Hat Enterprise Linux 7.3 or 7.4

  1. Update the kernel.

    sudo yum install kernel kernel-tools kernel-headers kernel-devel
    
    sudo reboot
    
  2. Install the latest Linux Integration Services for Hyper-V and Azure.

    wget https://aka.ms/lis
    
    tar xvzf lis
    
    cd LISISO
    
    sudo ./install.sh
    
    sudo reboot
    
  3. Reconnect to the VM and continue installation with the following commands:

    sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    
    sudo yum install dkms
    
    CUDA_REPO_PKG=cuda-repo-rhel7-9.1.85-1.x86_64.rpm
    
    wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/${CUDA_REPO_PKG} -O /tmp/${CUDA_REPO_PKG}
    
    sudo rpm -ivh /tmp/${CUDA_REPO_PKG}
    
    rm -f /tmp/${CUDA_REPO_PKG}
    
    sudo yum install cuda-drivers
    

    The installation can take several minutes.

  4. To optionally install the complete CUDA toolkit, type:

    sudo yum install cuda
    
  5. Reboot the VM and proceed to verify the installation.

Verify driver installation

To query the GPU device state, SSH to the VM and run the nvidia-smi command-line utility installed with the driver.

If the driver is installed, you will see output similar to the following. Note that GPU-Util shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.

NVIDIA device status

RDMA network connectivity

RDMA network connectivity can be enabled on RDMA-capable N-series VMs such as NC24r deployed in the same availability set or VM scale set. The RDMA network supports Message Passing Interface (MPI) traffic for applications running with Intel MPI 5.x or a later version. Additional requirements follow:

Distributions

Deploy RDMA-capable N-series VMs from one of the images in the Azure Marketplace that supports RDMA connectivity on N-series VMs:

  • Ubuntu 16.04 LTS - Configure RDMA drivers on the VM and register with Intel to download Intel MPI:

    1. Install dapl, rdmacm, ibverbs, and mlx4

      sudo apt-get update
      
      sudo apt-get install libdapl2 libmlx4-1
      
    2. In /etc/waagent.conf, enable RDMA by uncommenting the following configuration lines. You need root access to edit this file.

      OS.EnableRDMA=y
      
      OS.UpdateRdmaDriver=y
      
    3. Add or change the following memory settings in KB in the /etc/security/limits.conf file. You need root access to edit this file. For testing purposes you can set memlock to unlimited. For example: <User or group name> hard memlock unlimited.

      <User or group name> hard    memlock <memory required for your application in KB>
      
      <User or group name> soft    memlock <memory required for your application in KB>
      
    4. Install Intel MPI Library. Either purchase and download the library from Intel or download the free evaluation version.

      wget http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/9278/l_mpi_p_5.1.3.223.tgz
      

      Only Intel MPI 5.x runtimes are supported.

      For installation steps, see the Intel MPI Library Installation Guide.

    5. Enable ptrace for non-root non-debugger processes (needed for the most recent versions of Intel MPI).

      echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
      
  • CentOS-based 7.4 HPC - RDMA drivers and Intel MPI 5.1 are installed on the VM.

Install GRID drivers for NV-series VMs

To install NVIDIA GRID drivers on NV-series VMs, make an SSH connection to each VM and follow the steps for your Linux distribution.

Ubuntu 16.04 LTS

  1. Run the lspci command. Verify that the NVIDIA M60 card or cards are visible as PCI devices.

  2. Install updates.

    sudo apt-get update
    
    sudo apt-get upgrade -y
    
    sudo apt-get dist-upgrade -y
    
    sudo apt-get install build-essential ubuntu-desktop -y
    
  3. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV VMs.) To do this, create a file in /etc/modprobe.dnamed nouveau.conf with the following contents:

    blacklist nouveau
    
    blacklist lbm-nouveau
    
  4. Reboot the VM and reconnect. Exit X server:

    sudo systemctl stop lightdm.service
    
  5. Download and install the GRID driver:

    wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=849941  
    
    chmod +x NVIDIA-Linux-x86_64-grid.run
    
    sudo ./NVIDIA-Linux-x86_64-grid.run
    
  6. When you're asked whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.

  7. After installation completes, copy /etc/nvidia/gridd.conf.template to a new file gridd.conf at location /etc/nvidia/

    sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
    
  8. Add the following to /etc/nvidia/gridd.conf:

    IgnoreSP=TRUE
    
  9. Reboot the VM and proceed to verify the installation.

CentOS or Red Hat Enterprise Linux

  1. Update the kernel and DKMS.

    sudo yum update
    
    sudo yum install kernel-devel
    
    sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    
    sudo yum install dkms
    
  2. Disable the Nouveau kernel driver, which is incompatible with the NVIDIA driver. (Only use the NVIDIA driver on NV VMs.) To do this, create a file in /etc/modprobe.dnamed nouveau.conf with the following contents:

    blacklist nouveau
    
    blacklist lbm-nouveau
    
  3. Reboot the VM, reconnect, and install the latest Linux Integration Services for Hyper-V and Azure.

    wget https://aka.ms/lis
    
    tar xvzf lis
    
    cd LISISO
    
    sudo ./install.sh
    
    sudo reboot
    
  4. Reconnect to the VM and run the lspci command. Verify that the NVIDIA M60 card or cards are visible as PCI devices.

  5. Download and install the GRID driver:

    wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=849941  
    
    chmod +x NVIDIA-Linux-x86_64-grid.run
    
    sudo ./NVIDIA-Linux-x86_64-grid.run
    
  6. When you're asked whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.

  7. After installation completes, copy /etc/nvidia/gridd.conf.template to a new file gridd.conf at location /etc/nvidia/

    sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
    
  8. Add the following to /etc/nvidia/gridd.conf:

    IgnoreSP=TRUE
    
  9. Reboot the VM and proceed to verify the installation.

Verify driver installation

To query the GPU device state, SSH to the VM and run the nvidia-smi command-line utility installed with the driver.

If the driver is installed, you will see output similar to the following. Note that GPU-Util shows 0% unless you are currently running a GPU workload on the VM. Your driver version and GPU details may be different from the ones shown.

NVIDIA device status

X11 server

If you need an X11 server for remote connections to an NV VM, x11vnc is recommended because it allows hardware acceleration of graphics. The BusID of the M60 device must be manually added to the xconfig file (etc/X11/xorg.conf on Ubuntu 16.04 LTS, /etc/X11/XF86config on CentOS 7.3 or Red Hat Enterprise Server 7.3). Add a "Device" section similar to the following:

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla M60"
    BusID          "your-BusID:0:0:0"
EndSection

Additionally, update your "Screen" section to use this device.

The decimal BusID can be found by running

echo $((16#`/usr/bin/nvidia-smi --query-gpu=pci.bus_id --format=csv | tail -1 | cut -d ':' -f 1`))

The BusID can change when a VM gets reallocated or rebooted. Therefore, you may want to create a script to update the BusID in the X11 configuration when a VM is rebooted. For example, create a script named busidupdate.sh (or another name you choose) with the following contents:

#!/bin/bash
BUSID=$((16#`/usr/bin/nvidia-smi --query-gpu=pci.bus_id --format=csv | tail -1 | cut -d ':' -f 1`))

if grep -Fxq "${BUSID}" /etc/X11/XF86Config; then     echo "BUSID is matching"; else   echo "BUSID changed to ${BUSID}" && sed -i '/BusID/c\    BusID          \"PCI:0@'${BUSID}':0:0:0\"' /etc/X11/XF86Config; fi

Then, create an entry for your upate script in /etc/rc.d/rc3.d so the script is invoked as root on boot.

Troubleshooting

  • You can set persistence mode using nvidia-smi so the output of the command is faster when you need to query cards. To set persistence mode, execute nvidia-smi -pm 1. Note that if the VM is restarted, the mode setting goes away. You can always script the mode setting to execute upon startup.

Next steps