Tutorial: Train a classification model with automated machine learning in Azure Machine Learning service

In this tutorial, you'll learn how to generate a machine learning model using automated machine learning (automated ML). Azure Machine Learning service can perform data preprocessing, algorithm selection and hyperparameter selection in an automated way for you. The final model can then be deployed following the workflow in the Deploy a model tutorial.

flow diagram

Similar to the train models tutorial, this tutorial classifies handwritten images of digits (0-9) from the MNIST dataset. But this time you don't to specify an algorithm or tune hyperparameters. The automated ML technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.

You'll learn how to:

  • Set up your development environment
  • Access and examine the data
  • Train using an automated classifier on your local computer
  • Explore the results
  • Review training results
  • Register the best model

If you don’t have an Azure subscription, create a free account before you begin.

Get the notebook

For your convenience, this tutorial is available as a Jupyter notebook. Run the 03.auto-train-models.ipynb notebook either in Azure Notebooks or in your own Jupyter notebook server.

Azure Notebooks - Free Jupyter based notebooks in the Azure cloud

The SDK is already installed and configured for you on Azure Notebooks.

  1. Complete the getting started quickstart to create a workspace and launch Azure Notebooks.
  2. Go to Azure Notebooks
  3. In the Getting Started Library you created during the quickstart, go to the tutorials folder
  4. Open the notebook.

Your own Jupyter notebook server

  1. Complete the getting started with Python SDK quickstart to install the SDK and create a workspace.
  2. Clone the GitHub repository.
  3. Copy the aml_config directory you created during the quickstart into your cloned directory.
  4. Start the notebook server from your cloned directory.
  5. Go to the tutorials folder.
  6. Open the notebook.

Set up your development environment

All the setup for your development work can be accomplished in the Python notebook. Setup includes:

  • Import Python packages
  • Configure a workspace to enable communication between your local computer and remote resources
  • Create a directory to store training scripts

Import packages

Import Python packages you need in this tutorial.

import azureml.core
import pandas as pd
from azureml.core.workspace import Workspace
from azureml.train.automl.run import AutoMLRun
import time
import logging
from sklearn import datasets
from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import random
import numpy as np

Configure workspace

Create a workspace object from the existing workspace. Workspace.from_config() reads the file aml_config/config.json and loads the details into an object named ws. ws is used throughout the rest of the code in this tutorial.

Once you have a workspace object, specify a name for the experiment and create and register a local directory with the workspace. The history of all runs is recorded under the specified experiment.

ws = Workspace.from_config()
# project folder to save your local files
project_folder = './sample_projects/automl-local-classification'
# choose a name for the run history container in the workspace
experiment_name = 'automl-classifier'

import os

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

Explore data

The initial training tutorial used a high-resolution version of the MNIST dataset (28x28 pixels). Since automated ML training requires many iterations, this tutorial uses a smaller resolution version of the images (8x8 pixels) to demonstrate the concepts while speeding up the time needed for each iteration.

from sklearn import datasets

digits = datasets.load_digits()

# Exclude the first 100 rows from training so that they can be used for test.
X_train = digits.data[100:,:]
y_train = digits.target[100:]

Display some sample images

Load the data into numpy arrays. Then use matplotlib to plot 30 random images from the dataset with their labels above them.

count = 0
sample_size = 30
plt.figure(figsize = (16, 6))
for i in np.random.permutation(X_train.shape[0])[:sample_size]:
    count = count + 1
    plt.subplot(1, sample_size, count)
    plt.axhline('')
    plt.axvline('')
    plt.text(x = 2, y = -2, s = y_train[i], fontsize = 18)
    plt.imshow(X_train[i].reshape(8, 8), cmap = plt.cm.Greys)
plt.show()

A random sample of images displays:

digits

You now have the necessary packages and data ready for auto training for your model.

Train a model

To automatically train a model, first define configuration settings for the experiment and then run the experiment.

Define settings

Define the experiment settings and model settings.

Property Value in this tutorial Description
primary_metric AUC Weighted Metric that you want to optimize.
max_time_sec 12,000 Time limit in seconds for each iteration
iterations 20 Number of iterations. In each iteration, the model trains with the data with a specific pipeline
n_cross_validations 3 Number of cross validation splits
preprocess False True/False Enables experiment to perform preprocessing on the input. Preprocessing handles missing data, and performs some common feature extraction
exit_score 0.9985 double value indicating the target for primary_metric. Once the target is surpassed the run terminates
blacklist_algos ['kNN','LinearSVM'] Array of strings indicating algorithms to ignore.
from azureml.train.automl import AutoMLConfig

##Local compute 
Automl_config = AutoMLConfig(task = 'classification',
                             primary_metric = 'AUC_weighted',
                             max_time_sec = 12000,
                             iterations = 20,
                             n_cross_validations = 3,
                             preprocess = False,
                             exit_score = 0.9985,
                             blacklist_algos = ['kNN','LinearSVM'],
                             X = X_train,
                             y = y_train,
                             path=project_folder)

Run the experiment

Start the experiment to run locally. Define the compute target as local and set the output to true to view progress on the experiment.

from azureml.core.experiment import Experiment
experiment=Experiment(ws, experiment_name)
local_run = experiment.submit(Automl_config, show_output=True)

Output such as the following appears one line at a time as each iteration progresses. You will see a new line every 10-15 seconds.

Running locally
Parent Run ID: AutoML_ca0c807b-b7bf-4809-a963-61c6feb73ea1
***********************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE:  A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
***********************************************************************************************

 ITERATION     PIPELINE                               DURATION                METRIC      BEST
         0      Normalizer extra trees                0:00:15.955367           0.988     0.988
         1      Normalizer extra trees                0:00:14.203088           0.952     0.988
         2      Normalizer lgbm_classifier            0:00:15.089057           0.994     0.994
         3      Normalizer SGD classifier             0:00:14.866700           0.500     0.994
         4      Normalizer SGD classifier             0:00:13.740577           0.983     0.994
         5      Normalizer DT                         0:00:13.879204           0.937     0.994
         6      Normalizer SGD classifier             0:00:13.379975           0.980     0.994
         7      Normalizer lgbm_classifier            0:00:15.953293           0.997     0.997
Stopping criteria reached. Ending experiment.

Explore the results

Explore the results of experiment with a Jupyter widget or by examining the experiment history.

Jupyter widget

Use the Jupyter notebook widget to see a graph and a table of all results.

from azureml.train.widgets import RunDetails
RunDetails(local_run).show()

Here is a static image of the widget. In the notebook, you can use the dropdown above the graph to view a graph of each available metric for each iteration.

widget table widget plot

Retrieve all iterations

View the experiment history and see individual metrics for each iteration run.

children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

import pandas as pd
rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

This table shows the results:

0 1 2 3 4 5 6 7
AUC_macro 0.988094 0.951981 0.993606 0.5 0.982724 0.936998 0.979978 0.996639
AUC_micro 0.988104 0.948402 0.99413 0.463035 0.976078 0.945169 0.968913 0.997027
AUC_weighted 0.987943 0.952255 0.993513 0.5 0.982801 0.937292 0.979973 0.99656
AUC_weighted_max 0.987943 0.987943 0.993513 0.993513 0.993513 0.993513 0.993513 0.99656
accuracy 0.852093 0.666464 0.898057 0.0701284 0.832662 0.701827 0.83325 0.925752
average_precision_score_macro 0.929167 0.786258 0.961497 0.1 0.917486 0.685547 0.906611 0.977775
average_precision_score_micro 0.932596 0.728331 0.964138 0.0909031 0.880136 0.757538 0.859813 0.980408
average_precision_score_weighted 0.930681 0.788964 0.962007 0.102123 0.918785 0.692041 0.908293 0.977699
balanced_accuracy 0.917902 0.814509 0.94491 0.5 0.909248 0.833428 0.907412 0.959351
f1_score_macro 0.850511 0.643116 0.899262 0.013092 0.825054 0.691712 0.819627 0.926081
f1_score_micro 0.852093 0.666464 0.898057 0.0701284 0.832662 0.701827 0.83325 0.925752
f1_score_weighted 0.852134 0.646049 0.898705 0.00933691 0.830731 0.696538 0.824547 0.925778
log_loss 0.554364 1.15728 0.51741 2.30397 1.94009 1.57663 2.1848 0.250725
norm_macro_recall 0.835815 0.629003 0.890167 0 0.818755 0.666629 0.814739 0.918851
precision_score_macro 0.855892 0.707715 0.90195 0.00701284 0.84882 0.729611 0.855384 0.927881
precision_score_micro 0.852093 0.666464 0.898057 0.0701284 0.832662 0.701827 0.83325 0.925752
precision_score_weighted 0.859204 0.711918 0.903523 0.00500676 0.861209 0.737586 0.863524 0.928403
recall_score_macro 0.852234 0.666102 0.901151 0.1 0.83688 0.699966 0.833265 0.926966
recall_score_micro 0.852093 0.666464 0.898057 0.0701284 0.832662 0.701827 0.83325 0.925752
recall_score_weighted 0.852093 0.666464 0.898057 0.0701284 0.832662 0.701827 0.83325 0.925752
weighted_accuracy 0.851054 0.66639 0.895428 0.049121 0.829247 0.702754 0.833464 0.924723

Register the best model

Use the local_run object to get the best model and register it into the workspace.

# find the run with the highest accuracy value.
best_run, fitted_model = local_run.get_output()

# register model in workspace
description = 'Automated Machine Learning Model'
tags = None
local_run.register_model(description=description, tags=tags)
local_run.model_id # Use this id to deploy the model as a web service in Azure

Test the best model

Use the model to predict a few random digits. Display the predicted value and the image. Red font and inverse image (white on black) is used to highlight the misclassified samples.

Since the model accuracy is high, you might have to run the following code a few times before you can see a misclassified sample.

# find 30 random samples from test set
n = 30
X_test = digits.data[:100, :]
y_test = digits.target[:100]
sample_indices = np.random.permutation(X_test.shape[0])[0:n]
test_samples = X_test[sample_indices]


# predict using the  model
result = fitted_model.predict(test_samples)

# compare actual value vs. the predicted values:
i = 0
plt.figure(figsize = (20, 1))

for s in sample_indices:
    plt.subplot(1, n, i + 1)
    plt.axhline('')
    plt.axvline('')

    # use different color for misclassified sample
    font_color = 'red' if y_test[s] != result[i] else 'black'
    clr_map = plt.cm.gray if y_test[s] != result[i] else plt.cm.Greys

    plt.text(x = 2, y = -2, s = result[i], fontsize = 18, color = font_color)
    plt.imshow(X_test[s].reshape(8, 8), cmap = clr_map)

    i = i + 1
plt.show()

results

Clean up resources

Important

The resources you created can be used as prerequisites to other Azure Machine Learning service tutorials and how-to articles.

If you don't plan to use the resources you created here, delete them so you don't incur any charges.

  1. In the Azure portal, select Resource groups on the far left.

    Delete in Azure portal

  2. From the list, select the resource group you created.

  3. Select Delete resource group.

  4. Enter the resource group name, and then select Delete.

    If you see the error message "Cannot delete resource before nested resources are deleted," you must delete any nested resources first. For information on how to delete nested resources, see this troubleshooting section.

Next steps

In this Azure Machine Learning service tutorial, you used Python to:

  • Set up your development environment
  • Access and examine the data
  • Train using an automated classifier locally with custom parameters
  • Explore the results
  • Review training results
  • Register the best model

Learn more about how to configure settings for automatic training or how to use automatic training on a remote resource.