question

KevinOliver-8143 avatar image
0 Votes"
KevinOliver-8143 asked romungi-MSFT commented

Docker container fails to run due to invalid --GPU switch

I have been testing Azure ML experiments running locally on my machine with docker. So far I have run into the same issue using several curated environments as well as using a conda dependencies file.

  • The run job is submitted successfully

  • Docker container builds successfully

  • The docker run command fails due to the --gpu all switch

This switch gets added to every docker container I try to launch locally (doesn't matter the container type).

I have tested this a few different ways
- Using VScode and VScode insiders
- Running the experiment from code
- Running Experiment the Azure ML Extension in VScode


All attempts end with the docker container failing to run.

Any thought on how to fix this would be appreciated.



azure-machine-learning
· 7
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@KevinOliver-8143 It would be great if you could share any documentation link that you are using to run your container to check and replicate the issue. There are AzureML base images for GPU and CPU on docker hub, I wonder if you are using a GPU image that might be causing it to fail.


0 Votes 0 ·

@romungi-MSFT I have used several guides, but this one specifically has instructions for running a container locally

https://github.com/Azure/MachineLearningNotebooks/blob/3adebd11278686a23c13434b42340acb248b3133/configuration.ipynb

0 Votes 0 ·

The steps in the configuration notebook file creates a compute cluster either a CPU or GPU for the attached workspace on Azure. You can use this sample to train on local for a simple experiment. With respect to the error though it would be great if you could detail the steps so it can be replicated to check what could be incorrect. Thanks!!


0 Votes 0 ·

@romungi-MSFT


Using the sample you suggested, here are the steps I go through to get the error.
- Run Notebook step successfully till you hit Section 6.C.c
- Either the Run or run.wait_for_completion(show_output=True) cells will fail with the same error
The results are attached in the docker_output file.
55931-docker-output.txt

Docker Error
The error received during the docker run is attached as well.
Sample
[2021-01-12T21:03:34.495116] Logging experiment running status in history service.
Running: 'docker', 'run', '--name', 'train-on-local_1610485398_90bae222', '--rm', '-v', 'C:\\Users\\kevba\\AppData\\Local\\Temp\\azureml_runs\\train-on-local_1610485398_90bae222:/azureml-run', '--shm-size', '2g', '--gpus', 'all', '-e', ...
55932-run-local-docker-fail.txt


0 Votes 0 ·
Show more comments

0 Answers