Run a CNTK training job using the Azure CLI

Azure CLI 2.0 allows you to create and manage Batch AI resources - create/delete Batch AI file servers and clusters, and submit/terminate/delete/monitor training jobs.

This quickstart shows how to create a GPU cluster and run a training job using Microsoft Cognitive Toolkit (CNTK).

The training script ConvNet_MNIST.py is available at Batch AI GitHub page. This script trains a convolutional neural network on the MNIST database of handwritten digits.

The official CNTK example has been modified to accept the location of the training dataset and the output directory location via command-line arguments.

Quickstart Overview

  • Create a single node GPU cluster (with Standard_NC6 VM size) with name nc6;
  • Create a storage account to store job input and output;
  • Create an Azure File Share with two folders logs and scripts to store job output and training scripts;
  • Create an Azure Blob Container data to store training data;
  • Deploy the training script and the training data to the created file share and container;
  • Configure the job to mount the Azure File Share and Azure Blob Container on the cluster's node and make them available as regular file system at $AZ_BATCHAI_JOB_MOUNT_ROOT/logs, $AZ_BATCHAI_JOB_MOUNT_ROOT/scripts, and $AZ_BATCHAI_JOB_MOUNT_ROOT/data. AZ_BATCHAI_JOB_MOUNT_ROOT is an environment variable set by Batch AI for the job.
  • Monitor the job execution by streaming its standard output;
  • After the job completion, inspect its output and generated models;
  • At the end, delete all allocated resources.

Prerequisites

  • Azure subscription - If you don't have an Azure subscription, create a free account before you begin.
  • Access to Azure CLI 2.0 with version 0.3 or higher of the batchai module. You can either use Azure CLI 2.0 available in Azure Cloud Shell or install it locally following these instructions.

    If you are using Cloud Shell, change the working directory to /usr/$USER/clouddrive because your home directory has no empty space:

    cd /usr/$USER/clouddrive
    

Create a resource group

An Azure resource group is a logical container for deploying and managing Azure resources. The following command creates a new resource group batchai.quickstart in the East US location:

az group create -n batchai.quickstart -l eastus

Create Batch AI workspace

The following command creates a Batch AI workspace in the resource group. A Batch AI workspace is a top-level collection of all types of Batch AI resources:

az batchai workspace create -g batchai.quickstart -n quickstart

Create GPU cluster

The following command creates a single-node GPU cluster (VM size is Standard_NC6) in the workspace, using the Ubuntu Data Science Virtual Machine (DSVM) as the operating system image:

az batchai cluster create -n nc6 -g batchai.quickstart -w quickstart -s Standard_NC6 -i UbuntuDSVM -t 1 --generate-ssh-keys

The Ubuntu DSVM allows you to run any training jobs in Docker containers and to run most popular deep learning frameworks directly on the VM.

The --generate-ssh-keys option tells the Azure CLI to generate private and public ssh keys if you do not have them already. You can access the cluster nodes using the current user name and generated ssh key.

If you use Cloud Shell, it's recommended to back up the ~/.ssh folder to some permanent storage.

Example output:

{
  "allocationState": "steady",
  "allocationStateTransitionTime": "2018-04-11T21:17:26.345000+00:00",
  "creationTime": "2018-04-11T20:12:10.758000+00:00",
  "currentNodeCount": 0,
  "errors": null,
  "id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/batchai.quickstart/providers/Microsoft.BatchAI/workspaces/quickstart/clusters/nc6",
  "location": "eastus",
  "name": "nc6",
  "nodeSetup": null,
  "nodeStateCounts": {
    "additionalProperties": {},
    "idleNodeCount": 0,
    "leavingNodeCount": 0,
    "preparingNodeCount": 0,
    "runningNodeCount": 0,
    "unusableNodeCount": 0
  },
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-04-11T20:12:11.445000+00:00",
  "resourceGroup": "batchai.quickstart",
  "scaleSettings": {
    "additionalProperties": {},
    "autoScale": null,
    "manual": {
      "nodeDeallocationOption": "requeue",
      "targetNodeCount": 1
    }
  },
  "subnet": null,
  "tags": null,
  "type": "Microsoft.BatchAI/Clusters",
  "userAccountSettings": {
    "additionalProperties": {},
    "adminUserName": "myuser",
    "adminUserPassword": null,
    "adminUserSshPublicKey": "<YOUR SSH PUBLIC KEY HERE>"
  },
  "virtualMachineConfiguration": {
    "additionalProperties": {},
    "imageReference": {
      "additionalProperties": {},
      "offer": "linux-data-science-vm-ubuntu",
      "publisher": "microsoft-ads",
      "sku": "linuxdsvmubuntu",
      "version": "latest",
      "virtualMachineImageId": null
    }
  },
  "vmPriority": "dedicated",
  "vmSize": "STANDARD_NC6"
}

Create a storage account

The following command creates a storage account in the same resource group used to create the Batch AI cluster. You use the storage account to store job input and output. Update the command with a unique storage account name.

az storage account create -n <storage account name> --sku Standard_LRS -g batchai.quickstart

Deploy data

Download the training script and training data

  • Download and extract a preprocessed MNIST database from this location into the current folder.

For GNU/Linux or Cloud Shell:

wget "https://batchaisamples.blob.core.windows.net/samples/mnist_dataset.zip?st=2017-09-29T18%3A29%3A00Z&se=2099-12-31T08%3A00%3A00Z&sp=rl&sv=2016-05-31&sr=c&sig=PmhL%2BYnYAyNTZr1DM2JySvrI12e%2F4wZNIwCtf7TRI%2BM%3D" -O mnist_dataset.zip
unzip mnist_dataset.zip

Note, you may need to install unzip if your GNU/Linux distribution doesn't have it.

For GNU/Linux or Cloud Shell:

wget https://raw.githubusercontent.com/Azure/BatchAI/master/recipes/CNTK/CNTK-GPU-Python/ConvNet_MNIST.py

Create Azure file share and deploy the training script

The following commands create Azure File shares scripts and logs and copy the training script into the cntk folder in the scripts share:

az storage share create -n scripts --account-name <storage account name>
az storage share create -n logs --account-name <storage account name>
az storage directory create -n cntk -s scripts --account-name <storage account name>
az storage file upload -s scripts --source ConvNet_MNIST.py --path cntk --account-name <storage account name> 

Create a blob container and deploy training data

The following commands create an Azure blob container named data and copy training data into the mnist_cntk folder:

az storage container create -n data --account-name <storage account name>
az storage blob upload-batch -s . --pattern '*28x28_cntk*' --destination data --destination-path mnist_cntk --account-name <storage account name>

Submit training job

Create a Batch AI experiment

An experiment is a logical container for related Batch AI jobs. Use the following command to create an experiment in your workspace:

az batchai experiment create -g batchai.quickstart -w quickstart -n quickstart

Prepare job configuration file

Create a training job configuration file job.json with the following content. Update with the name of your storage account.

{
    "$schema": "https://raw.githubusercontent.com/Azure/BatchAI/master/schemas/2018-03-01/cntk.json",
    "properties": {
        "nodeCount": 1,
        "cntkSettings": {
            "pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/cntk/ConvNet_MNIST.py",
            "commandLineArgs": "$AZ_BATCHAI_JOB_MOUNT_ROOT/data/mnist_cntk $AZ_BATCHAI_OUTPUT_MODEL"
        },
        "stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
        "outputDirectories": [{
            "id": "MODEL",
            "pathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs"
        }],
        "mountVolumes": {
            "azureFileShares": [
                {
                    "azureFileUrl": "https://<YOUR_STORAGE_ACCOUNT>.file.core.windows.net/logs",
                    "relativeMountPath": "logs"
                },
                {
                    "azureFileUrl": "https://<YOUR_STORAGE_ACCOUNT>.file.core.windows.net/scripts",
                    "relativeMountPath": "scripts"
                }
            ],
            "azureBlobFileSystems": [
                {
                    "accountName": "<YOUR_STORAGE_ACCOUNT>",
                    "containerName": "data",
                    "relativeMountPath": "data"
                }
            ]
        }
    }
}

This configuration file specifies:

  • nodeCount - number of nodes required by the job (1 for this quickstart)
  • cntkSettings - specifies the path of the training script and command-line arguments. Command-line arguments include the path to training data and the destination path for storing generated models. AZ_BATCHAI_OUTPUT_MODEL is an environment variable set by Batch AI based on output directory configuration (see below)
  • stdOutErrPathPrefix - path where Batch AI creates directories containing job output and logs
  • outputDirectories - collection of output directories to be created by Batch AI. For each directory, Batch AI creates an environment variable with name AZ_BATCHAI_OUTPUT_<id>, where <id> is the directory identifier
  • mountVolumes - list of filesystems to be mounted during the job execution. The filesystems are mounted under AZ_BATCHAI_JOB_MOUNT_ROOT/<relativeMountPath>. AZ_BATCHAI_JOB_MOUNT_ROOT is an environment variable set by Batch AI
  • <AZURE_BATCHAI_STORAGE_ACCOUNT> - the storage account name to be specified during job submission via --storage-account-name parameter or AZURE_BATCHAI_STORAGE_ACCOUNT environment variable on your computer.

Submit the Job

az batchai job create -n cntk_python_1 -c nc6 -g batchai.quickstart -w quickstart -e quickstart  -f job.json --storage-account-name <storage account name>

Example output:

{
  "caffe2Settings": null,
  "caffeSettings": null,
  "chainerSettings": null,
  "cluster": {
    "id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/batchai.quickstart/providers/Microsoft.BatchAI/workspaces/quickstart/clusters/nc6",
    "resourceGroup": "batchai.quickstart"
  },
  "cntkSettings": {
    "commandLineArgs": "$AZ_BATCHAI_JOB_MOUNT_ROOT/data/mnist_cntk $AZ_BATCHAI_OUTPUT_MODEL",
    "configFilePath": null,
    "languageType": "Python",
    "processCount": 1,
    "pythonInterpreterPath": null,
    "pythonScriptFilePath": "$AZ_BATCHAI_JOB_MOUNT_ROOT/scripts/cntk/ConvNet_MNIST.py"
  },
  "constraints": {
    "maxWallClockTime": "7 days, 0:00:00"
  },
  "containerSettings": null,
  "creationTime": "2018-06-14T22:22:57.543000+00:00",
  "customMpiSettings": null,
  "customToolkitSettings": null,
  "environmentVariables": null,
  "executionInfo": {
    "endTime": null,
    "errors": null,
    "exitCode": null,
    "startTime": "2018-06-14T22:22:59.838000+00:00"
  },
  "executionState": "running",
  "executionStateTransitionTime": "2018-06-14T22:22:59.838000+00:00",
  "horovodSettings": null,
  "id": "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/batchai.quickstart/providers/Microsoft.BatchAI/workspaces/quickstart/experiments/quickstart/jobs/cntk_python_1",
  "inputDirectories": null,
  "jobOutputDirectoryPathSegment": "00000000-0000-0000-0000-000000000000/batchai.quickstart/workspaces/quickstart/experiments/quickstart/jobs/cntk_python_1/f2d6ff09-7549-4e1a-8cd8-ec839f042a61",
  "jobPreparation": null,
  "mountVolumes": {
    "azureBlobFileSystems": [
      {
        "accountName": "<YOUR STORAGE ACCOUNT NAME>",
        "containerName": "data",
        "credentials": {
          "accountKey": null,
          "accountKeySecretReference": null
        },
        "mountOptions": null,
        "relativeMountPath": "data"
      }
    ],
    "azureFileShares": [
      {
        "accountName": "<YOUR STORAGE ACCOUNT NAME>",
        "azureFileUrl": "https://<YOUR STORAGE ACCOUNT NAME>.file.core.windows.net/logs",
        "credentials": {
          "accountKey": null,
          "accountKeySecretReference": null
        },
        "directoryMode": "0777",
        "fileMode": "0777",
        "relativeMountPath": "logs"
      },
      {
        "accountName": "<YOUR STORAGE ACCOUNT NAME>",
        "azureFileUrl": "https://<YOUR STORAGE ACCOUNT NAME>.file.core.windows.net/scripts",
        "credentials": {
          "accountKey": null,
          "accountKeySecretReference": null
        },
        "directoryMode": "0777",
        "fileMode": "0777",
        "relativeMountPath": "scripts"
      }
    ],
    "fileServers": null,
    "unmanagedFileSystems": null
  },
  "name": "cntk_python_1",
  "nodeCount": 1,
  "outputDirectories": [
    {
      "id": "MODEL",
      "pathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
      "pathSuffix": null
    }
  ],
  "provisioningState": "succeeded",
  "provisioningStateTransitionTime": "2018-06-14T22:22:58.625000+00:00",
  "pyTorchSettings": null,
  "resourceGroup": "danlep0614b",
  "schedulingPriority": "normal",
  "secrets": null,
  "stdOutErrPathPrefix": "$AZ_BATCHAI_JOB_MOUNT_ROOT/logs",
  "tensorFlowSettings": null,
  "toolType": "cntk",
  "type": "Microsoft.BatchAI/workspaces/experiments/jobs"
}

Monitor job execution

The training script reports the training progress in the stderr.txt file in the standard output directory. Monitor the progress using the following command:

az batchai job file stream -j cntk_python_1 -g batchai.quickstart -w quickstart -e quickstart -f stderr.txt

Example output:

File found with URL "https://<YOUR STORAGE ACCOUNT>.file.core.windows.net/logs/00000000-0000-0000-0000-000000000000/batchai.quickstart/jobs/cntk_python_1/<JOB's UUID>/stdouterr/stderr.txt?sv=2016-05-31&sr=f&sig=n86JK9YowV%2BPQ%2BkBzmqr0eud%2FlpRB%2FVu%2FFlcKZx192k%3D&se=2018-04-11T23%3A05%3A54Z&sp=rl". Start streaming
Selected GPU[0] Tesla K80 as the process wide default device.
-------------------------------------------------------------------
Build info:

        Built time: Jan 31 2018 15:03:41
        Last modified date: Tue Jan 30 03:26:13 2018
        Build type: release
        Build target: GPU
        With 1bit-SGD: no
        With ASGD: yes
        Math lib: mkl
        CUDA version: 9.0.0
        CUDNN version: 7.0.4
        Build Branch: HEAD
        Build SHA1: a70455c7abe76596853f8e6a77a4d6de1e3ba76e
        MPI distribution: Open MPI
        MPI version: 1.10.7
-------------------------------------------------------------------
Training 98778 parameters in 10 parameter tensors.

Learning rate per 1 samples: 0.001
Momentum per 1 samples: 0.0
Finished Epoch[1 of 40]: [Training] loss = 0.405960 * 60000, metric = 13.01% * 60000 21.741s (2759.8 samples/s);
Finished Epoch[2 of 40]: [Training] loss = 0.106030 * 60000, metric = 3.09% * 60000 3.638s (16492.6 samples/s);
Finished Epoch[3 of 40]: [Training] loss = 0.078542 * 60000, metric = 2.32% * 60000 3.477s (17256.3 samples/s);
...
Final Results: Minibatch[1-11]: errs = 0.62% * 10000

The streaming stops when the job is completed (succeeded or failed).

Inspect generated model files

The job stores the generated model files in the output directory with id attribute equals to MODEL. List model files and get download URLs using the following command:

az batchai job file list -j cntk_python_1 -w quickstart -e quickstart -g batchai.quickstart -d MODEL

Example output:

[
  {
    "additionalProperties": {},
    "contentLength": 409456,
    "downloadUrl": "https://<YOUR STORAGE ACCOUNT>.file.core.windows.net/...",
    "isDirectory": false,
    "lastModified": "2018-04-11T22:05:51+00:00",
    "name": "ConvNet_MNIST_0.dnn"
  },
  {
    "additionalProperties": {},
    "contentLength": 409456,
    "downloadUrl": "https://<YOUR STORAGE ACCOUNT>.file.core.windows.net/...",
    "isDirectory": false,
    "lastModified": "2018-04-11T22:05:55+00:00",
    "name": "ConvNet_MNIST_1.dnn"
  },
...

Alternatively, use the Azure portal or Azure Storage Explorer to inspect the generated files. To distinguish output from the different jobs, Batch AI creates a unique folder structure for each of them. Find the path to the folder containing the output using the jobOutputDirectoryPathSegment attribute of the submitted job:

az batchai job show -n cntk_python_1 -g batchai.quickstart -w quickstart -e quickstart --query jobOutputDirectoryPathSegment

Example output:

"00000000-0000-0000-0000-000000000000/batchai.quickstart/workspaces/quickstart/experiments/quickstart/jobs/cntk_python_1/<JOB's UUID>"

Clean up resources

When no longer needed, delete the resource group and all allocated resources with the following command:

az group delete -n batchai.quickstart -y