I want to train AI model, and in the VM instance executing the command below worked well
pip install -r requirement.txt
python ~
Then in order to train the Ai model in the same environment in the VM computing cluster, in the Python 3.8 - AzureML notebook I executed below (I'm sorry I couldn't attach the screenshot)
import azureml.core
from azureml.core import Workspace
import os
from azureml.core import ScriptRunConfig
from azureml.core import Datastore
from azureml.core import Experiment
from azureml.core import Dataset
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core import Environment
import datetime
cluster_name = 'high-2x-v100-1'
gpu_name = 'Standard_NC12s_v3'
experiment_name = 'training_agent_print'
hyperparameters = [
'--max_train_time', '172800'
]
script_folder = './script_folder'
# workspace
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')
# compute cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", cluster_name)
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", gpu_name)
if compute_name in ws.compute_targets:
compute_target = ws.compute_targets[compute_name]
if compute_target and type(compute_target) is AmlCompute:
print('found compute target. just use it. ' + compute_name)
else:
print('creating a new compute target...')
provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
min_nodes=compute_min_nodes,
max_nodes=compute_max_nodes)
compute_target = ComputeTarget.create(
ws, compute_name, provisioning_config)
# environment
env = Environment.from_pip_requirements(name = "m8-pip-training", file_path = "./requirements.txt")
exp = Experiment(workspace=ws,name=experiment_name)
# run
src = ScriptRunConfig(source_directory=script_folder,
script='main.py',
arguments=hyperparameters,
compute_target=compute_target,
environment=env
)
run = exp.submit(config=src)
as a result, in the 20_image_build_log.txt file, I got the log as below
==> WARNING: A newer version of conda exists. <==
current version: 4.9.2
latest version: 4.10.3
Please update conda by running
$ conda update -n base -c defaults conda
Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement parlai==1.3.0 (from -r /azureml-environment-setup/condaenv.5svatkzc.requirements.txt (line 55)) (from versions: 0.1.20200409, 0.1.20200416, 0.1.20200610, 0.1.20200713, 0.1.20200716, 0.8.0, 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4)
ERROR: No matching distribution found for parlai==1.3.0 (from -r /azureml-environment-setup/condaenv.5svatkzc.requirements.txt (line 55))
CondaEnvException: Pip failed
[0mThe command '/bin/sh -c ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_ba289e67ead35c3dbaac125150111737 -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig' returned a non-zero code: 1
2021/08/10 15:13:41 Container failed during run: acb_step_0. No retries remaining.
failed to run step ID: acb_step_0: exit status 1
Run ID: caj failed after 2m24s. Error: failed during run, err: exit status 1
Ans the experiment failed. I have 3 questions
1. Why computing cluster is using conda to build image even though I export the file from pip?
2. Can I build the environment using pip?
3. As there is WARNING, if I can update the conda to latest version, the experiment might not faile. Can I update the conda in the computing cluster?
Thank you so much