Setting the threshold of a binary learning model in Azure ML

This is the last of three articles about performance measures and graphs for binary learning models in Azure ML. Binary learning models are models which just predict one of two outcomes: positive or negative. These models are very well suited to drive decisions, such as whether to administer a patient a certain drug or to include a lead in a targeted marketing campaign.

This final article will cover the threshold setting, and how to find the optimal value for it. As you will learn, this requires a good understanding of error cost, that is, the cost of inaccurate predictions.

The first article laid the foundation by covering several statistical measures: accuracy, precision, recall and F1 score. These measures require a solid understanding of the two types of prediction errors which we also covered: false positives and false negatives.

Part 1: Performance measures in Azure ML: Accuracy, Precision, Recall and F1 Score

The second article discussed the ROC curve and the related AUC measure. We also looked at another graph in Azure ML called the Precision/Recall curve.

Part 2: Using ROC plots and the AUC measure in Azure ML

If you have not yet done so, I encourage reading the other articles before first.

At the time of this writing, Azure ML is still in preview and most of the documentation you find on the internet appears to be written by data scientists for data scientists. This blog is focused on people who want to learn Azure ML, but do not have a PhD in data science.

Getting started

When you build a machine learning solution, you’ll normally first try and compare multiple models. As we discussed in part 2, you'd normally use the AUC measure to determine the best performing model. Once you made your selection, you can remove all the other models and optimize the selected one. (In case you want to re-evaluate the models in the future with a new data set, it might be better to just create a new experiment).

In binary models you can influence the behavior using the threshold slider. If you’re not sure how to do this and what the effect is of changing the threshold, please go back and review parts 1 and 2 of this series.

The threshold lets you change the balance of two types of prediction errors: false positives and false negatives. The cost associated with one type of error can be much higher than the cost of the other. If that’s the case, it makes sense to err on the side of safety which means you will accept a (much) higher number of cheap errors to reduce the number of expensive errors.

Therefore, in order to determine the optimal value for threshold, you first need to determine the cost of false positives and false negatives. I’ll use an example to explain this and provide a spreadsheet which helps you do the calculation.

An example

We’ll use the same data set as in the previous posts, but instead of predicting if someone is above 50 years old, we’ll act as if this is data from a quality test in a factory. The data is from a test device which tests whether a given product is thought to be faulty (positive) or not (negative). The testing data has been labeled with an additional piece of data: whether the product was in fact defective after further inspection. This allows us to train the model and to validate its performance. Based on the labeled data, we know that about 24% of the devices are in fact defective.

Now you need to determine the cost of false positives and false negatives. This will require some research and if possible you may want to include an accountant from your company to do this. Keep in mind that you need to include opportunity cost and that some of the cost, such as customer dissatisfaction and reputation damage, is difficult to quantify.

False Positive: these are devices which were tested as defective, when in fact they are not. This type of error leads to rework, and let’s assume that the cost associated with this unnecessary re-work is $10 per wrong prediction.

False Negative: these are defective devices which were shipped to customers because the defects were not detected. The cost of these types of errors are much higher due to extra handling, customer dissatisfaction and reputation damage. Let’s assume that the cost of this type of error is $100 per wrong prediction.

Based on the cost per error and the frequency that each type of error occurs, we can calculate the average error cost per prediction. The optimal threshold is the one with the lowest average error cost per prediction. The math is fairly straightforward, and the following spreadsheet shows the average error cost per prediction in the right-most column and the related threshold setting in the left-most column. The optimal threshold is 0.10 which has a cost of $2.70.


The table shows various performance measures for different threshold settings. There are a lot of things we can conclude from this table:

  • The default threshold of 0.5 would have resulted in an error cost of $6.50 per prediction, much higher than the optimal cost of $2.70.
  • Precision and recall are – when used independently – useless to determine the right threshold setting because they are conflicting forces. If you only focus on one of them, you’ll end up with a threshold of either 0 or 1.
  • The optimal accuracy is obtained with a threshold value of 0.55. The accuracy at that point is 90.0%. As we concluded in part 1, accuracy is not a very good measure when the class distribution is uneven, and it does not take the error cost into account. This table confirms that. If we would use this threshold, we’d incur 2.7x too much cost.
  • The F1 Score is the weighted average between Precision and Recall. If the cost of false positives and false negatives were roughly equal, this would be a pretty good measure. However, that’s not the case. If we would have used the F1 score to determine the optimal threshold, we’d incur almost 2x to much cost.

Here is how the costs vary by threshold, displayed in a graph:


As you can see, the optimal threshold is where the grey line reaches its lowest point.

Sample spreadsheet

You can download the spreadsheet which was used to generate the table and chart here:

1207.UAC and ROC Aalyses.xlsx 1207.UAC and ROC Aalyses.xlsx

To use this spreadsheet and analyze data from a different data set, you need to replace the data in the Data worksheet:

  • Column A (Actual) has the actual (labeled) value for each observation. In the spreadsheet the cells in this column contain either “Defective” or “Not defective”, but this could be anything as long as there are only two different values.
  • Column B (Scored Probabilities) has the scores of the scored data set. Every score is a real number between 0 and 1.
  • Column C (Pos) and D (Neg) have a simple formula which set to value to 0 or 1, depending on the value in column A. You’ll probably need to adjust these formulas to reflect the data in column A.

When you want to load data from another data set, it’s important that this data is already scored. Here’s how you would download the scored data for the sample experiment “Sample 5: Train, Test, Evaluate for Binary Classification: Adult Dataset”.

  • Log into your Azure ML workspace.
  • In the Experiments screen, click on the Samples link in the top of the screen and then click on the sample called “Sample 5: Train, Test, Evaluate for Binary Classification: Adult Dataset” .
  • Click the Save As button in the bottom of the screen and save the experiment under a different name. This will enable us to edit this sample.
  • Add a “Convert to CSV” item to the workspace and connect its input to the output of the Score Model item which you wish to analyze (see screenshot below)
  • Run the model
  • After the model has completed running, right-click the output of the Convert to CSV item and select Download.


Putting it together

This was the last of a series of three articles on using performance measures and plots in Azure ML. We discovered that for model selection, the ROC plot and AUC measure are the best and easiest to use. Once the best model has been selected, you can optimize its performance by adjusting the threshold, as we discussed in this last article.

I hope you found this useful. Please leave comments or send me email if you have any further suggestions.