Hi, I am training my models via Azure Machine Learning.
On other day, my training is running with GPU support, however today I found my training is running on a CPU.
I'm not modified training environment, only training script was modified.
My computing cluster is NC6v3 - have a GPU.
I investigate a situation, and I found training script is running on PyTorch 1.6.0.
On other day, it ran on Pytorch 1.8.1.
I think my "don't use GPU" problem is caused by the situation that CUDA toolkit version is not suitable for Pytorch version.
Then, I output a installed package to the log.
The log says 'Pytorch 1.8.1 was installed, however uses 1.6.0'.
I confused by this weird circumstances.
Can someone tell me the solution?
<My code snippet>
<<conda_dependencies.yaml>>
channels:
- conda-forge
- pytorch
- nvidia
dependencies:
- python=3.8.10
- mesa-libgl-cos6-x86_64
- cudatoolkit=11.1
- pytorch==1.8.1
- torchvision==0.9.1
- tqdm
- scikit-learn
- matplotlib
- pandas
- pip < 20.3
- pip:
- azureml-defaults
- opencv-python-headless
- pillow==8.2.0
<<Environment definition>>
environment_definition_file = experiment_dir / 'conda_dependencies.yaml'
environment_name = 'pytorch-1.8.1-gpu'
base_image_name = 'mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04'
environment = Environment.from_docker_image(environment_name, base_image_name, conda_specification = environment_definition_file)
docker_run_config = DockerConfiguration(use_docker=True)
script_run_config = ScriptRunConfig(
source_directory = experiment_dir,
script = SCRIPT_FILE_NAME,
arguments = arguments,
compute_target = compute_target,
docker_runtime_config = docker_run_config,
environment = environment)
<<Output a log in the training script>>
import torch
import pip
pip.main(['list'])
print(f'PyTorch version: {torch.version}')
<My logs>
Package Version
adal 1.2.7
applicationinsights 0.11.10
(omission)
torch 1.8.1
torchvision 0.9.0a0
(omission)
PyTorch version: 1.6.0