question

ClaudiaVanea-8710 avatar image
0 Votes"
ClaudiaVanea-8710 asked ramr-msft edited

Pytorch cannot detect GPU when using an AML Compute Cluster with a GPU

Hi,

I've been trying to train a pytorch model on the Azure ML compute clusters (STANDARD_NV6) but I cannot get the code to detect and use the GPU device, torch.cuda.is_available() always returns False.

I'm using a custom environment and have tried using a few different dockerfiles as base images from the Microsoft container repository. For example, I've tried the "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04" base image.

In the build log, I can see that the correct dependencies are installed each time but the code still doesn't detect a GPU. I tried forcing docker to use the GPU with docker_arguments = ["--gpus", "all"] but this causes the build to fail with this error:

 AzureMLCompute job failed.
 FailedStartingContainer: Unable to start docker container
     FailedContainerStart: Unable to start docker container
     err: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
 docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
    
     Reason: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
 docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
    
     Info: Failed to prepare an environment for the job execution: Job environment preparation failed on 10.0.0.5 with err exit status 1.

It feels like I've missed some obvious step somewhere...

Thanks for any help!

azure-machine-learning
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

ramr-msft avatar image
0 Votes"
ramr-msft answered ramr-msft edited

@ClaudiaVanea-8710 Thanks for the question. which means driver issues, Can you please add more details about the Pytorch version that you using. Especially with pytorch where somehow the pytorch doesn’t install correctly with the latest CUDA drivers. Can you please try installing the latest nvdia drivers.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks for getting back to me!

I am using this yaml file to create the environment which installs pytorch 1.7.1 and cudatoolkit 10.2.89:

 name: pytorch-env
 channels:
     - defaults
     - pytorch
 dependencies:
     - python=3.8
     - pytorch
     - torchvision
     - cudatoolkit=10.2
     - pandas
     - scikit-learn
     - pip
     - pip:
         - azureml-sdk

I don't know how to install/update nvidia drivers on an AML compute cluster, perhaps this is the problem? How would I go about adding this step to the build?

0 Votes 0 ·