Tutorial: Train your first machine learning model (part 3 of 4)

This tutorial shows you how to train a machine learning model in Azure Machine Learning.

This tutorial is part 3 of a four-part tutorial series in which you learn the fundamentals of Azure Machine Learning and complete jobs-based machine learning tasks in Azure. This tutorial builds on the work that you completed in Part 1: Set up and Part 2: Run "Hello world!" of the series.

In this tutorial, you take the next step by submitting a script that trains a machine learning model. This example will help you understand how Azure Machine Learning eases consistent behavior between local debugging and remote runs.

In this tutorial, you:

  • Create a training script.
  • Use Conda to define an Azure Machine Learning environment.
  • Create a control script.
  • Understand Azure Machine Learning classes (Environment, Run, Metrics).
  • Submit and run your training script.
  • View your code output in the cloud.
  • Log metrics to Azure Machine Learning.
  • View your metrics in the cloud.

Prerequisites

Create training scripts

First you define the neural network architecture in a model.py file. All your training code will go into the src subdirectory, including model.py.

The following code is taken from this introductory example from PyTorch. Note that the Azure Machine Learning concepts apply to any machine learning code, not just PyTorch.

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Next you define the training script. This script downloads the CIFAR10 dataset by using PyTorch torchvision.dataset APIs, sets up the network defined in model.py, and trains it for two epochs by using standard SGD and cross-entropy loss.

Create a train.py script in the src subdirectory:

import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

from model import Net

# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
    root="./data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor(),
)
trainloader = torch.utils.data.DataLoader(
    trainset, batch_size=4, shuffle=True, num_workers=2
)

if __name__ == "__main__":

    # define convolutional network
    net = Net()

    # set up pytorch loss /  optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    # train the network
    for epoch in range(2):

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # unpack the data
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:
                loss = running_loss / 2000
                print(f"epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}")
                running_loss = 0.0

    print("Finished Training")

You now have the following directory structure:

tutorial
└──.azureml
|  └──config.json
└──src
|  └──hello.py
|  └──model.py
|  └──train.py
└──01-create-workspace.py
└──02-create-compute.py
└──03-run-hello.py

Create a new Python environment

Create a file called pytorch-env.yml in the .azureml hidden directory:


name: pytorch-env
channels:
    - defaults
    - pytorch
dependencies:
    - python=3.6.2
    - pytorch
    - torchvision

This environment has all the dependencies that your model and training script require. Notice there's no dependency on the Azure Machine Learning SDK for Python.

Test locally

In a terminal or Anaconda Prompt window, use the following code to test your script locally in the new environment.

conda deactivate                                # If you are still using the tutorial environment, exit it
conda env create -f .azureml/pytorch-env.yml    # create the new Conda environment
conda activate pytorch-env                      # activate new Conda environment
python src/train.py                             # train model

After you run this script, you'll see the data downloaded into a directory called tutorial/data.

Create the control script

The difference between the following control script and the one that you used to submit "Hello world!" is that you add a couple of extra lines to set the environment.

Create a new Python file in the tutorial directory called 04-run-pytorch.py:

# 04-run-pytorch.py
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core import Environment
from azureml.core import ScriptRunConfig

if __name__ == "__main__":
    ws = Workspace.from_config()
    experiment = Experiment(workspace=ws, name='day1-experiment-train')
    config = ScriptRunConfig(source_directory='./src',
                             script='train.py',
                             compute_target='cpu-cluster')

    # set up pytorch environment
    env = Environment.from_conda_specification(
        name='pytorch-env',
        file_path='./.azureml/pytorch-env.yml'
    )
    config.run_config.environment = env

    run = experiment.submit(config)

    aml_url = run.get_portal_url()
    print(aml_url)

Understand the code changes

env = ...

Azure Machine Learning provides the concept of an environment to represent a reproducible, versioned Python environment for running experiments. It's easy to create an environment from a local Conda or pip environment.

config.run_config.environment = env

Adds the environment to ScriptRunConfig.

Submit the run to Azure Machine Learning

Switch back to the tutorial environment that has the Azure Machine Learning SDK for Python installed. Since the training code isn't running on your computer, you don't need to have PyTorch installed. But you do need the azureml-sdk, which is in the tutorial environment.

conda deactivate
conda activate tutorial
python 04-run-pytorch.py

Note

The first time you run this script, Azure Machine Learning will build a new Docker image from your PyTorch environment. The whole run might take 5 to 10 minutes to complete.

You can see the Docker build logs in the Azure Machine Learning studio. Follow the link to the studio, select the Outputs + logs tab, and then select 20_image_build_log.txt.

This image will be reused in future runs to make them run much quicker.

After your image is built, select 70_driver_log.txt to see the output of your training script.

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
...
Files already downloaded and verified
epoch=1, batch= 2000: loss 2.19
epoch=1, batch= 4000: loss 1.82
epoch=1, batch= 6000: loss 1.66
epoch=1, batch= 8000: loss 1.58
epoch=1, batch=10000: loss 1.52
epoch=1, batch=12000: loss 1.47
epoch=2, batch= 2000: loss 1.39
epoch=2, batch= 4000: loss 1.38
epoch=2, batch= 6000: loss 1.37
epoch=2, batch= 8000: loss 1.33
epoch=2, batch=10000: loss 1.31
epoch=2, batch=12000: loss 1.27
Finished Training

Warning

If you see an error Your total snapshot size exceeds the limit, the data directory is located in the source_directory value used in ScriptRunConfig.

Move data outside src.

Environments can be registered to a workspace with env.register(ws). They can then be easily shared, reused, and versioned. Environments make it easy to reproduce previous results and to collaborate with your team.

Azure Machine Learning also maintains a collection of curated environments. These environments cover common machine learning scenarios and are backed by cached Docker images. Cached Docker images make the first remote run faster.

In short, using registered environments can save you time! Read How to use environments for more information.

Log training metrics

Now that you have a model training in Azure Machine Learning, start tracking some performance metrics.

The current training script prints metrics to the terminal. Azure Machine Learning provides a mechanism for logging metrics with more functionality. By adding a few lines of code, you gain the ability to visualize metrics in the studio and to compare metrics between multiple runs.

Modify train.py to include logging

Modify your train.py script to include two more lines of code:

import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

from model import Net
from azureml.core import Run


# ADDITIONAL CODE: get AML run from the current context
run = Run.get_context()

# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor()
)
trainloader = torch.utils.data.DataLoader(
    trainset,
    batch_size=4,
    shuffle=True,
    num_workers=2
)

if __name__ == "__main__":

    # define convolutional network
    net = Net()

    # set up pytorch loss /  optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    # train the network
    for epoch in range(2):

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # unpack the data
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:
                loss = running_loss / 2000
                # ADDITIONAL CODE: log loss metric to AML
                run.log('loss', loss)
                print(f'epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}')
                running_loss = 0.0

    print('Finished Training')

Understand the additional two lines of code

In train.py, you access the run object from within the training script itself by using the Run.get_context() method and use it to log metrics:

# in train.py
run = Run.get_context()

...

run.log('loss', loss)

Metrics in Azure Machine Learning are:

  • Organized by experiment and run, so it's easy to keep track of and compare metrics.
  • Equipped with a UI so you can visualize training performance in the studio.
  • Designed to scale, so you keep these benefits even as you run hundreds of experiments.

Update the Conda environment file

The train.py script just took a new dependency on azureml.core. Update pytorch-env.yml to reflect this change:

name: pytorch-aml-env
channels:
    - defaults
    - pytorch
dependencies:
    - python=3.6.2
    - pytorch
    - torchvision
    - pip
    - pip:
        - azureml-sdk

Submit the run to Azure Machine Learning

Submit this script once more:

python 04-run-pytorch.py

This time when you visit the studio, go to the Metrics tab where you can now see live updates on the model training loss!

Training loss graph on the Metrics tab.

Next steps

In this session, you upgraded from a basic "Hello world!" script to a more realistic training script that required a specific Python environment to run. You saw how to take a local Conda environment to the cloud with Azure Machine Learning environments. Finally, you saw how in a few lines of code you can log metrics to Azure Machine Learning.

There are other ways to create Azure Machine Learning environments, including from a pip requirements.txt file or from an existing local Conda environment.

In the next session, you'll see how to work with data in Azure Machine Learning by uploading the CIFAR10 dataset to Azure.

Note

If you want to finish the tutorial series here and not progress to the next step, remember to clean up your resources.