Tutorial: Train your first machine learning model (part 3 of 4)
This tutorial shows you how to train a machine learning model in Azure Machine Learning.
This tutorial is part 3 of a four-part tutorial series in which you learn the fundamentals of Azure Machine Learning and complete jobs-based machine learning tasks in Azure. This tutorial builds on the work that you completed in Part 1: Set up and Part 2: Run "Hello world!" of the series.
In this tutorial, you take the next step by submitting a script that trains a machine learning model. This example will help you understand how Azure Machine Learning eases consistent behavior between local debugging and remote runs.
In this tutorial, you:
- Create a training script.
- Use Conda to define an Azure Machine Learning environment.
- Create a control script.
- Understand Azure Machine Learning classes (
Environment
,Run
,Metrics
). - Submit and run your training script.
- View your code output in the cloud.
- Log metrics to Azure Machine Learning.
- View your metrics in the cloud.
Prerequisites
- Anaconda or Miniconda to manage Python virtual environments and install packages.
- Completion of part1 and part 2 of the series.
Create training scripts
First you define the neural network architecture in a model.py
file. All your training code will go into the src
subdirectory, including model.py
.
The following code is taken from this introductory example from PyTorch. Note that the Azure Machine Learning concepts apply to any machine learning code, not just PyTorch.
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
Next you define the training script. This script downloads the CIFAR10 dataset by using PyTorch torchvision.dataset
APIs, sets up the network defined in model.py
, and trains it for two epochs by using standard SGD and cross-entropy loss.
Create a train.py
script in the src
subdirectory:
import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from model import Net
# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
root="./data",
train=True,
download=True,
transform=torchvision.transforms.ToTensor(),
)
trainloader = torch.utils.data.DataLoader(
trainset, batch_size=4, shuffle=True, num_workers=2
)
if __name__ == "__main__":
# define convolutional network
net = Net()
# set up pytorch loss / optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
# train the network
for epoch in range(2):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# unpack the data
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999:
loss = running_loss / 2000
print(f"epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}")
running_loss = 0.0
print("Finished Training")
You now have the following directory structure:
tutorial
└──.azureml
| └──config.json
└──src
| └──hello.py
| └──model.py
| └──train.py
└──01-create-workspace.py
└──02-create-compute.py
└──03-run-hello.py
Create a new Python environment
Create a file called pytorch-env.yml
in the .azureml
hidden directory:
name: pytorch-env
channels:
- defaults
- pytorch
dependencies:
- python=3.6.2
- pytorch
- torchvision
This environment has all the dependencies that your model and training script require. Notice there's no dependency on the Azure Machine Learning SDK for Python.
Test locally
In a terminal or Anaconda Prompt window, use the following code to test your script locally in the new environment.
conda deactivate # If you are still using the tutorial environment, exit it
conda env create -f .azureml/pytorch-env.yml # create the new Conda environment
conda activate pytorch-env # activate new Conda environment
python src/train.py # train model
After you run this script, you'll see the data downloaded into a directory called tutorial/data
.
Create the control script
The difference between the following control script and the one that you used to submit "Hello world!" is that you add a couple of extra lines to set the environment.
Create a new Python file in the tutorial
directory called 04-run-pytorch.py
:
# 04-run-pytorch.py
from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core import Environment
from azureml.core import ScriptRunConfig
if __name__ == "__main__":
ws = Workspace.from_config()
experiment = Experiment(workspace=ws, name='day1-experiment-train')
config = ScriptRunConfig(source_directory='./src',
script='train.py',
compute_target='cpu-cluster')
# set up pytorch environment
env = Environment.from_conda_specification(
name='pytorch-env',
file_path='./.azureml/pytorch-env.yml'
)
config.run_config.environment = env
run = experiment.submit(config)
aml_url = run.get_portal_url()
print(aml_url)
Understand the code changes
env = ...
Azure Machine Learning provides the concept of an environment to represent a reproducible, versioned Python environment for running experiments. It's easy to create an environment from a local Conda or pip environment.
config.run_config.environment = env
Adds the environment to ScriptRunConfig.
Submit the run to Azure Machine Learning
Switch back to the tutorial environment that has the Azure Machine Learning SDK for Python installed. Since the training code isn't running on your computer, you don't need to have PyTorch installed. But you do need the azureml-sdk
, which is in the tutorial environment.
conda deactivate
conda activate tutorial
python 04-run-pytorch.py
Note
The first time you run this script, Azure Machine Learning will build a new Docker image from your PyTorch environment. The whole run might take 5 to 10 minutes to complete.
You can see the Docker build logs in the Azure Machine Learning studio. Follow the link to the studio, select the Outputs + logs tab, and then select 20_image_build_log.txt
.
This image will be reused in future runs to make them run much quicker.
After your image is built, select 70_driver_log.txt
to see the output of your training script.
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
...
Files already downloaded and verified
epoch=1, batch= 2000: loss 2.19
epoch=1, batch= 4000: loss 1.82
epoch=1, batch= 6000: loss 1.66
epoch=1, batch= 8000: loss 1.58
epoch=1, batch=10000: loss 1.52
epoch=1, batch=12000: loss 1.47
epoch=2, batch= 2000: loss 1.39
epoch=2, batch= 4000: loss 1.38
epoch=2, batch= 6000: loss 1.37
epoch=2, batch= 8000: loss 1.33
epoch=2, batch=10000: loss 1.31
epoch=2, batch=12000: loss 1.27
Finished Training
Warning
If you see an error Your total snapshot size exceeds the limit
, the data
directory is located in the source_directory
value used in ScriptRunConfig
.
Move data
outside src
.
Environments can be registered to a workspace with env.register(ws)
. They can then be easily shared, reused, and versioned. Environments make it easy to reproduce previous results and to collaborate with your team.
Azure Machine Learning also maintains a collection of curated environments. These environments cover common machine learning scenarios and are backed by cached Docker images. Cached Docker images make the first remote run faster.
In short, using registered environments can save you time! Read How to use environments for more information.
Log training metrics
Now that you have a model training in Azure Machine Learning, start tracking some performance metrics.
The current training script prints metrics to the terminal. Azure Machine Learning provides a mechanism for logging metrics with more functionality. By adding a few lines of code, you gain the ability to visualize metrics in the studio and to compare metrics between multiple runs.
Modify train.py
to include logging
Modify your train.py
script to include two more lines of code:
import torch
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from model import Net
from azureml.core import Run
# ADDITIONAL CODE: get AML run from the current context
run = Run.get_context()
# download CIFAR 10 data
trainset = torchvision.datasets.CIFAR10(
root='./data',
train=True,
download=True,
transform=torchvision.transforms.ToTensor()
)
trainloader = torch.utils.data.DataLoader(
trainset,
batch_size=4,
shuffle=True,
num_workers=2
)
if __name__ == "__main__":
# define convolutional network
net = Net()
# set up pytorch loss / optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
# train the network
for epoch in range(2):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# unpack the data
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999:
loss = running_loss / 2000
# ADDITIONAL CODE: log loss metric to AML
run.log('loss', loss)
print(f'epoch={epoch + 1}, batch={i + 1:5}: loss {loss:.2f}')
running_loss = 0.0
print('Finished Training')
Understand the additional two lines of code
In train.py
, you access the run object from within the training script itself by using the Run.get_context()
method and use it to log metrics:
# in train.py
run = Run.get_context()
...
run.log('loss', loss)
Metrics in Azure Machine Learning are:
- Organized by experiment and run, so it's easy to keep track of and compare metrics.
- Equipped with a UI so you can visualize training performance in the studio.
- Designed to scale, so you keep these benefits even as you run hundreds of experiments.
Update the Conda environment file
The train.py
script just took a new dependency on azureml.core
. Update pytorch-env.yml
to reflect this change:
name: pytorch-aml-env
channels:
- defaults
- pytorch
dependencies:
- python=3.6.2
- pytorch
- torchvision
- pip
- pip:
- azureml-sdk
Submit the run to Azure Machine Learning
Submit this script once more:
python 04-run-pytorch.py
This time when you visit the studio, go to the Metrics tab where you can now see live updates on the model training loss!
Next steps
In this session, you upgraded from a basic "Hello world!" script to a more realistic training script that required a specific Python environment to run. You saw how to take a local Conda environment to the cloud with Azure Machine Learning environments. Finally, you saw how in a few lines of code you can log metrics to Azure Machine Learning.
There are other ways to create Azure Machine Learning environments, including from a pip requirements.txt file or from an existing local Conda environment.
In the next session, you'll see how to work with data in Azure Machine Learning by uploading the CIFAR10 dataset to Azure.
Note
If you want to finish the tutorial series here and not progress to the next step, remember to clean up your resources.