Hi,
I've been trying to train a pytorch model on the Azure ML compute clusters (STANDARD_NV6) but I cannot get the code to detect and use the GPU device, torch.cuda.is_available() always returns False.
I'm using a custom environment and have tried using a few different dockerfiles as base images from the Microsoft container repository. For example, I've tried the "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn7-ubuntu18.04" base image.
In the build log, I can see that the correct dependencies are installed each time but the code still doesn't detect a GPU. I tried forcing docker to use the GPU with docker_arguments = ["--gpus", "all"] but this causes the build to fail with this error:
AzureMLCompute job failed.
FailedStartingContainer: Unable to start docker container
FailedContainerStart: Unable to start docker container
err: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
Reason: warning: your kernel does not support swap limit capabilities or the cgroup is not mounted. memory limited without swap.
docker: error response from daemon: oci runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /mnt/docker/overlay2/66b78fe178db5d08ca4db26528f1a6de00aba65b528a6568649b1abcbea22348/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
Info: Failed to prepare an environment for the job execution: Job environment preparation failed on 10.0.0.5 with err exit status 1.
It feels like I've missed some obvious step somewhere...
Thanks for any help!