I am using AZ Machine Learning and have been running python scripts on VMs to train mnist and output some summary statistics on the trained networks. It worked fine for the first few jobs, but when I submitted a few more, all of them failed with a USerScriptFilledDisk error:
"UserError: AzureMLCompute job failed. UserScriptFilledDisk: User script filled the disk. Consider using VM SKU with larger disk size. If the issue persists contact Azure Support."
I am using nodes with only 7GB disk space, but it still does not make sense to me that I should have exceeded that just with mounting mnist and writing less than 1MB of numpy arrays to './outputs/'. The problem does not seem to be specific to any one or few nodes on my cluster. I made a new cluster and tried running my scripts on it. It still throws the same error. So how can I find out what disk I have filled up and how do fix it and keep it from happening again?
Thanks in advance!
More details:
I created an Azure machine learning compute cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpu-main1")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 100)
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_DS1_V2")
if compute_name in ws.compute_targets:
compute_target = ws.compute_targets[compute_name]
else:
provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
min_nodes = compute_min_nodes,
max_nodes = compute_max_nodes)
compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
I added a data set
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='mnist_zip')
dataset.download(target_path='.', overwrite=True)
dataset = dataset.register(workspace=ws,
name='mnist_zip',
description='zip file with preprocesses mnist data set',
create_new_version=False)
I submitted jobs to the cluster
runs = [ 0 for _ in range(30)]
for i in range(30):
args = ['--dataset', dataset.as_mount(), '--id', i]
#also tried '.as_download()' - did not seem to make a difference
src = ScriptRunConfig(source_directory=script_folder,
script='script.py',
arguments=args,
compute_target=compute_target,
environment=env)
runs[i] = exp.submit(config=src)