Using ROC plots and the AUC measure in Azure ML
This is the second of three articles about performance measures and graphs for binary learning models in Azure ML. Binary learning models are models which just predict one of two outcomes: positive or negative. These models are very well suited to drive decisions, such as whether to administer a patient a certain drug or to include a lead in a targeted marketing campaign.
The first article laid the foundation by covering several statistical measures: accuracy, precision, recall and F1 score. These measures require a solid understanding of the two types of prediction errors which we also covered: false positives and false negatives. If you have not yet done so, I encourage reading that article first:
In this second article we’ll discuss the ROC curve and the related AUC measure. We’ll also look at another graph in Azure ML called the Precision/Recall curve.
The final article will cover the threshold setting, and how to find the optimal value for it. As you will learn, this requires a good understanding of the cost of inaccurate predictions.
At the time of this writing, Azure ML is still in preview and most of the documentation you find on the internet appears to be written by data scientists for data scientists. This blog is focused on people who want to learn Azure ML, but do not have a PhD in data science.
If you’re new to Azure ML, you’ll probably soon wonder what the term “AUC” means and how to read ROC plots. Both of these are used as a way to measure how well a binary learning model is performing. You can use this to compare multiple learning models and decide which one works best for a given data set.
This is what a ROC plot in Azure ML looks like:
If you want to follow along yourself, you can get to the screen above as follows:
- Log into your Azure ML workspace.
- In the Experiments screen, click on the Samples link in the top of the screen and then click on the sample called “Sample 5: Train, Test, Evaluate for Binary Classification: Adult Dataset” .
- In the design window, right-click on the output of the Evaluate item and select Visualize.
Before we dig deeper into the ROC plot, we need to take a closer look at the threshold setting.
Threshold and scoring values
When you scroll down in the screen with the ROC plot, you’ll see something like this:
The statistics you see here are for the blue curve which represents the left input of the Evaluate item in the experiment. Notice that when you drag the Threshold slider, all the numbers, except for AUC, change.
To understand why this is the case, you need to know that the learning model doesn’t just predict positive or negative as an outcome, but it generates a score which is a real number between 0 and 1, and then uses the threshold setting to decide if the prediction is a “positive” or a “negative”. By default, the threshold is set to 0.5 which means that all scoring values above that value are interpreted as positive, and everything below as negative. Therefore, the threshold impacts the predictions which the model makes and thus the related statistics.
While 0.5 is a good default to start with, you will eventually want to tweak the threshold to get the best results. This is the topic of the third article in this series.
The ROC plot
In the ROC plot which was shown before, two learning models are compared. The blue curve represents the model which feeds the left input of the Evaluate item, the red one represents the right input pin. (In this context performance does not relate to the speed but to how well the model predicts the desired outcome).
The blue and red lines in the plot connect the lower-left corner with the upper-right corner of the graph. Each position on the lines represents how the model performs in two different dimensions for a certain threshold setting. Unfortunately the ROC plot does not clearly show which point on the lines represents the current threshold value (perhaps the Azure ML Dev team can note this as a feature request:-), however the lower-left corner (0,0) represents a threshold of 1 whereas the upper-right corner (1,1) represents a threshold of 0. The two dimensions are the false positive rate on the horizontal axis and the true positive rate on the vertical axis.
In part 1 I explained how there are two different types of errors, false positives and false negatives. The two dimensions in the graph basically reflect this.
By changing the threshold, you decrease the frequency of one type of error at the expense of increasing the other type. The position on the curve where the slope is 1 is where they are in balance such that, if you would move the threshold slider a little bit, the number of one type of errors would increase by the same percentage as the other would decrease..
Usually when you see two models compared in a graph as the one above, the lines only intersect in the corners. The line which comes closest to the upper-left corner – in this case the blue line - provides the best predictions. For every threshold value it scores better than the other model for both false positives and false negatives. The further a curve is to the middle, the worse is its performance. Theoretically the worst possible model is one which produces random predictions with the same distribution as the class distribution. Such a model has a ROC curve which approximates a straight diagonal line from (0,0) to (1,1).
You might wonder if curves can also occur under the diagonal line. The answer is that if the ROC plot for a learning model would consistently perform under the diagonal, you could simply use the opposite value of what the model predicts as your actual prediction, and the result would then perform better than a random prediction.
The ROC plot was invented in WWII for the analyses for radar signals (see the article about Receiver operating characteristic on Wikipedia) and is commonly used by data scientists. However, I don't find it extremely insightful because the way I think of it, the ROC plot is about striking the right balance between the two types of errors. While the horizontal axis represents the ratio of false positives, the vertical axis shows the ratio of true positives, which only indirectly implies the ratio of false negatives. In other words, on the horizontal axis a higher value implies worse performance, while on the vertical axis a higher value implies a better performance.
Azure ML also includes another graph called for the precision/recall, which essentially displays the same information as the ROC plot, but in a more intuitive way.
To understand how this graph relates to false positives and false negatives, you need to remember that Precision and Recall are calculated as follows:
Recall = True Positives / (True Positives + False Negatives)
Precision = True Positives / (True Positives + False Positives)
In this curve, the “sweet spot” for the ideal model is in the upper right corner.
AUC stands for “area under curve”, and as it's name implies, it refers to the amount of area under the ROC curve, which theoretically is a value between 0 and 1. As I explained, the worst possible curve in practice is a diagonal line, hence the AUC should never be lower than 0.5 (for large data sets) .
Using the AUC metric you can quickly compare multiple learning models. Remember that the ROC curves of two models usually don’t cross each other, hence when comparing two models, the one with a higher AUC will be the better one regardless of the threshold setting. Compared to the statistical measures of accuracy, precision, recall and F1 score, AUC’s independence of threshold makes it uniquely qualified for model selection.
On the other hand, unlike accuracy, precision, recall and F1 score, AUC does not tell us what performance to expect from the model for a given threshold setting, nor can it be used to determine the optimal value for threshold. In that regard it doesn't take away the need for the other statistical measures.
Putting it together
The ROC plot and the AUC are very useful for comparing and selecting the best machine learning model for a given data set. A model with an AUC score near 1, and where the ROC curve comes close to the upper left corner, has a very good performance. A model with a score near 0.5 will have a curve near to the diagonal and its performance is hardly better than a random predictor.
After selecting the best model, the next step will be to configure the threshold and the other model configuration settings. The Precision/Recall plot can be helpful for understanding the trade-off between false positives and false negatives. In the third article we’ll look closer at the threshold value and learn now to calculate the optimal value for it.
Technorati Tags: Azure ML