Article
01/22/2016

April 2013

Volume 28 Number 04

CLR - Classification and Prediction Using Adaptive Boosting

By James McCaffrey | April 2013

Classification is a machine-learning technique that uses training data to generate a model (usually a single complex rule or mathematical equation) that assigns data items to one of several distinct categories. The model can then be used to make predictions about new data items whose category is unknown. Examples include predicting whether a patient has cancer (yes, no) based on various medical test data, and predicting the risk category (low, medium, high) of a loan application based on an applicant’s financial history.

There are many different classification algorithms and techniques, including Naive Bayes, neural network and logistic regression. In this article I explain a fascinating technique called adaptive boosting classification in which, instead of attempting to determine a single complex prediction rule, training data is used to generate a large collection of very simple crude rules of thumb. A weight for each rule of thumb is then computed. A prediction about new input is made by combining the rules of thumb, taking into account each simple rule’s weight and arriving at a consensus outcome. The term “boosting” comes from the fact that the predictive quality of the simple rules is boosted (improved) by combining them.

Adaptive boosting is a meta-heuristic. By that I mean adaptive boosting is a set of guidelines that can be used to create a specific classification algorithm. There are many variations of adaptive boosting algorithms and there are many existing standalone tools that implement some form of adaptive boosting, so why bother to code adaptive boosting classification from scratch? Existing adaptive boosting classification tools can be difficult or impossible to customize, they might be difficult to integrate into a software system, and they may have copyright or intellectual property issues.

A concrete example is the best way to understand what adaptive boosting is. Take a look at Figure 1. The ultimate problem of the demo program is to predict whether some sports team from Seattle will win or lose an upcoming contest against a team from Detroit, when Seattle will be playing on its home field and the betting consensus (the spread) is that Seattle has a slight advantage (small). The top part of Figure 1 shows 10 hypothetical training data items with known results. The first training tuple (Detroit, Home, Large, Win) means that in a previous game against Detroit, when Seattle played at home and the point spread was large, Seattle won the game.

Figure 1 Adaptive Boosting Classification and Prediction

The next part of the demo program shows that the 10 training data items were converted from string data into zero-based integer form for more efficient processing and then stored into machine memory as a matrix. For example, (Detroit, Home, Large, Win) is stored as (3, 0, 2, 1). Notice that in this example, the outcome to be predicted has just two values, win or lose. This is called a binary classification problem. Adaptive boosting can also be used in situations where the dependent variable has three or more values (multinomial classification). Most binary classification techniques encode the dependent variable to be predicted using a (0, 1) scheme, but adaptive boosting almost always uses a (-1, +1) scheme because that encoding slightly simplifies some of the algorithm implementation. Notice that all of the independent variable predictor values are categorical (“Home,” “Medium” and so on) rather than numerical. Adaptive boosting can also be used when training data has numerical values, as I’ll explain later.

The third part of Figure 1 shows that eight rules of thumb were generated from the training data. Rules of thumb are often called weak learners or weak classifiers in adaptive boosting terminology. The first weak learner, expressed in human-friendly form, is “IF Opponent IS Buffalo THEN Result IS Win” with a 0.00 raw error rate. The second weak learner, which is more illustrative, is “IF Opponent IS Chicago THEN Result IS Lose” with a 0.33 error rate. Where did this weak learner come from? If you examine the training data, you’ll see that there are three instances where the opponent is Chicago. In two of those training instances the result was Lose, so the weak learner is correct two out of three times (0.67) and wrong one time out of three (0.33).

Notice that not all training item predictor values generated a weak learner—there’s no rule for when the opponent is Atlanta. Because there are two training tuples where opponent is Atlanta, and one outcome is Win and the other outcome is Lose, the error rate for the learner would be 0.50, which doesn’t provide any useful information. The version of adaptive boosting classification presented in this article assumes that all weak learners have a raw error rate that is less than 0.50.

The next section of Figure 1 indicates that behind the scenes the adaptive boosting algorithm processes the weak learners to find a weight for each learner. These weights are measures of the importance of each weak classifier, and are called alpha values in adaptive boosting terminology. Determining the alpha values is the key part of adaptive boosting. The set of weak learners and their alpha weights make up the classification model.

The last part of the demo program shows the classification model being used to make a prediction for team Seattle when the opponent team is Detroit, the field is Home and the point spread is Small. If you refer to the set of generated weak learners, you’ll see that three learners are applicable: [2] IF Opponent IS Detroit THEN Result IS Win (+1), [3] IF Field IS Home THEN Result IS Win (+1), and [5] IF Spread IS Small THEN Result IS Lose (-1). The computed alpha values for weak learners 2, 3, and 5 are 0.63, 3.15, and 4.49, respectively. Therefore, the consensus prediction = (0.63)(+1) + (3.15)(+1) + (4.49)(-1) = -0.72 (rounded), which, because the value is negative, means Lose. Notice that even though two of three weak learners (2 and 3) predict Win, the large alpha of weak learner 5 outweighs those Win predictions to yield an overall prediction of Lose.

In the sections that follow I’ll carefully explain how the demo program works so that you can add prediction features to a .NET system or application. This article assumes you have advanced programming skills with a C-family language but does not assume you know anything about classification with adaptive boosting. I coded the demo using C# but the explanation should give you enough information to refactor the code to other languages such as Visual Basic .NET or Python. The code for the demo program is too long to fully list in this article, but the complete source code is available at msdn.com/magazine/msdnmag0413.

The Adaptive Boosting Algorithm

The heart of adaptive boosting classification is a routine that examines each weak learner and assigns an alpha weight to each. The algorithm is quite tricky and best explained by a concrete example. Suppose there are 10 training tuples and eight weak learners, as shown in Figure 1. Each training tuple is assigned a weight, usually called D in adaptive boosting literature. The sum of the D weights is 1.0, making the D values a distribution. Initially all training data items are assigned equal D weights, in this case 0.10 because there are 10 items:

[0] (D = 0.10) Detroit   Home   Large    Win
[1] (D = 0.10) Detroit   Away   Medium   Win
[2] (D = 0.10) Buffalo   Home   Small    Win
[3] (D = 0.10) Buffalo   Home   Medium   Win
[4] (D = 0.10) Atlanta   Away   Large    Win
[5] (D = 0.10) Chicago   Home   Medium   Win
[6] (D = 0.10) Chicago   Away   Small    Lose
[7] (D = 0.10) Chicago   Home   Small    Lose
[8] (D = 0.10) Atlanta   Away   Medium   Lose
[9] (D = 0.10) Detroit   Away   Large    Lose

Each learner has an epsilon value and an alpha value. Epsilon values are weighted error rates used to compute alpha values. Initially all learners have unknown alpha values (say a = 0.00) and unknown epsilon values (say e = -1.0):

[0] (a = 0.00) (e = -1.0) IF Opponent IS Buffalo THEN Result IS Win
[1] (a = 0.00) (e = -1.0) IF Opponent IS Chicago THEN Result IS Lose
[2] (a = 0.00) (e = -1.0) IF Opponent IS Detroit THEN Result IS Win
[3] (a = 0.00) (e = -1.0) IF Field IS Home THEN Result IS Win
[4] (a = 0.00) (e = -1.0) IF Field IS Away THEN Result IS Lose
[5] (a = 0.00) (e = -1.0) IF Spread IS Small THEN Result IS Lose
[6] (a = 0.00) (e = -1.0) IF Spread IS Medium THEN Result IS Win
[7] (a = 0.00) (e = -1.0) IF Spread IS Large THEN Result IS Win

In pseudo-code, the algorithm to find the alpha weights for each learner is:

set t=0
while not done loop
  update all learners' epsilons (weighted errors)
  find best (smallest epsilon) unused learner
  compute and save the alpha of best learner using its epsilon
  update the D weights for each training item using the best learner
  normalize the D weights so they sum to 1.0
  ++t
end loop

The main processing loop terminates when all weak learners have been processed and assigned an alpha weight; or when the loop counter variable t exceeds some maximum value; or when the weighted error rate, epsilon, for the best unused weak learner is some value, such as 0.45 or 0.49, indicating there aren’t any relatively good unused learners left to process.

The first step inside the loop is to update all epsilons. An epsilon value is the sum of the D weights of incorrectly classified training tuples. For learner [0] (IF Opponent IS Buffalo THEN Result IS Win) there are two applicable training tuples, [2] and [3], and the rule is correct in both instances, so epsilon is 0.00. For learner [1] (IF Opponent IS Chicago THEN Result IS Lose) there are three applicable training tuples, [5], [6] and [7]. Of these, tuple [5] is incorrect, so epsilon is just the D weight for tuple [5] = 0.10.

Although it’s not immediately obvious, if you carefully review how epsilons are computed, you’ll notice that epsilon values will always be between 0.0 and 0.5. After all updates, the learner epsilons are:

[0] (a = 0.00) (e = 0.00) IF Opponent IS Buffalo THEN Result IS Win
[1] (a = 0.00) (e = 0.10) IF Opponent IS Chicago THEN Result IS Lose
[2] (a = 0.00) (e = 0.10) IF Opponent IS Detroit THEN Result IS Win
[3] (a = 0.00) (e = 0.10) IF Field IS Home THEN Result IS Win
[4] (a = 0.00) (e = 0.20) IF Field IS Away THEN Result IS Lose
[5] (a = 0.00) (e = 0.10) IF Spread IS Small THEN Result IS Lose
[6] (a = 0.00) (e = 0.10) IF Spread IS Medium THEN Result IS Win
[7] (a = 0.00) (e = 0.10) IF Spread IS Large THEN Result IS Win

At this point the best learner is selected. It is learner [0] because its epsilon is smallest at 0.00. The associated alpha is computed as:

alpha = 0.5 * log((1.0 - epsilon) / epsilon)

This is essentially a magic equation from adaptive boosting theory. Here, log is the natural (base e) logarithm. Recall that the alpha value is a weight that assigns an importance to a learner. The previous equation is designed so that smaller values of epsilon (learner error) yield larger values of alpha (learner importance).

In this particular situation there’s a problem because epsilon is 0, so there’d be a division by 0 error. To avoid this, the demo program arbitrarily converts any epsilon with a 0 value to 0.000001. So the alpha for learner [0] is computed as 0.5 * log(0.999999 / 0.000001) = 6.91 and that value is assigned to learner [0], and learner [0] is flagged as completed.

The next step inside the algorithm loop is to update the D training tuple weights based on the just-computed best learner. The idea is to increase the D weights for those training tuples that are incorrectly classified by the best learner and decrease the D weights for training tuples that are correctly classified by the best learner. The D update equation is a bit tricky at first glance:

D(new) = D(old) * exp(-alpha * actualY * predictedY)

The best learner was learner [0] (IF Opponent IS Buffalo THEN Result IS Win) with an alpha of 6.91. Of the 10 training tuples, learner [0] applies to tuples [2] and [3], so those D values are updated. For training tuple [2]:

D(new) = 0.10 * exp(-6.91 * (+1) * (+1))
       = 0.10 * exp(-6.91)
       = 0.0000997758

The new D value for training tuple [3] has the same computation and same value as tuple [2].

In this case, we get a new D weight that’s very small because the training tuple was correctly classified by the best learner. Notice that when the actual Y value and predicted Y value are the same (both are -1 or both are +1), when multiplied together the result is +1 and the argument to exp will be a negative number (because -alpha will always be negative), which will yield a result less than 1.0. However, if the actual Y and predicted Y are different, their product will be -1, and the argument to exp will be positive, which will yield a (possibly large) number greater than 1.0. This update technique for D is why adaptive boosting classification typically uses -1 and +1 for the dependent variable values rather than 0 and +1.

At this point the preliminary approximate D values (rounded) are:

[0] (D = 0.1000) Detroit   Home   Large    Win
[1] (D = 0.1000) Detroit   Away   Medium   Win
[2] (D = 0.0001) Buffalo   Home   Small    Win
[3] (D = 0.0001) Buffalo   Home   Medium   Win
[4] (D = 0.1000) Atlanta   Away   Large    Win
[5] (D = 0.1000) Chicago   Home   Medium   Win
[6] (D = 0.1000) Chicago   Away   Small    Lose
[7] (D = 0.1000) Chicago   Home   Small    Lose
[8] (D = 0.1000) Atlanta   Away   Medium   Lose
[9] (D = 0.1000) Detroit   Away   Large    Lose

The next step in the main algorithm loop is to normalize the D values so they sum to 1.0 by dividing each preliminary D value by the sum of the D values. The sum of the 10 D values is about 0.8002, so the normalized D for training tuple [0] is approximately 0.1000 / 0.8002 = 0.1249. The final updated D weights are:

[0] (D = 0.1249) Detroit   Home   Large    Win
[1] (D = 0.1249) Detroit   Away   Medium   Win
[2] (D = 0.0001) Buffalo   Home   Small    Win
[3] (D = 0.0001) Buffalo   Home   Medium   Win
[4] (D = 0.1249) Atlanta   Away   Large    Win
[5] (D = 0.1249) Chicago   Home   Medium   Win
[6] (D = 0.1249) Chicago   Away   Small    Lose
[7] (D = 0.1249) Chicago   Home   Small    Lose
[8] (D = 0.1249) Atlanta   Away   Medium   Lose
[9] (D = 0.1249) Detroit   Away   Large    Lose

The idea here is that we want the algorithm to now focus on training tuples other than [2] and [3] because those tuples have been accounted for by learner [0]. At this point the algorithm jumps back to the top of the loop and updates all learners’ epsilon values based on the newly computed D weights, determines the best unused learner (learner [5]), computes the alpha for the best learner (4.49), updates the D values for applicable training tuples ([2], [6] and [7]) and computes normalized D values for all training tuples.

In this example, the process continues until alpha values for all eight weak learners have been computed, but in general, not all weak learners will necessarily make the cut as good learners and be assigned an alpha value.

Overall Program Structure

The demo program shown in Figure 1 is a single C# console application. I used Visual Studio 2010, but any version that supports the Microsoft .NET Framework 2.0 or higher will work. I created a new project named AdaptiveBoosting and then in the Solution Explorer window renamed Program.cs to the more descriptive AdaptiveBoostingProgram.cs, which also automatically renamed class Program. I deleted all the template-generated using statements at the top of the source code except for the references to the System and Collections.Generic namespaces. The Main method—with some WriteLine statements removed, a few other minor edits and a key program-defined class to define weak learner objects—is listed in Figure 2.

Figure 2 Overall Program Structure

using System;
using System.Collections.Generic;
namespace AdaptiveBoosting
{
  class Program
  {
    static void Main(string[] args)
    {
      try
      {
        Console.WriteLine("\nBegin adaptive boosting classification demo\n");
        string[] features = new string[] { "Opponent", "Field", "Spread",
          "Result" };
        string[][] values = new string[4][];
        values[0] = new string[] { "Atlanta", "Buffalo", "Chicago",
          "Detroit" }; // opponent
        values[1] = new string[] { "Home", "Away" };  // Field
        values[2] = new string[] { "Small ", "Medium", "Large " }; 
        // Note: Spaces added
        values[3] = new string[] { "Lose", "Win" }; 
        // The dependent/predicted variable
        string[][] rawTrain = new string[10][];
        rawTrain[0] = new string[] { "Detroit", "Home", "Large ", "Win" };
        rawTrain[1] = new string[] { "Detroit", "Away", "Medium", "Win" };
        rawTrain[2] = new string[] { "Buffalo", "Home", "Small ", "Win" };
        rawTrain[3] = new string[] { "Buffalo", "Home", "Medium", "Win" };
        rawTrain[4] = new string[] { "Atlanta", "Away", "Large ", "Win" };
        rawTrain[5] = new string[] { "Chicago", "Home", "Medium", "Win" };
        rawTrain[6] = new string[] { "Chicago", "Away", "Small ", "Lose" };
        rawTrain[7] = new string[] { "Chicago", "Home", "Small ", "Lose" };
        rawTrain[8] = new string[] { "Atlanta", "Away", "Medium", "Lose" };
        rawTrain[9] = new string[] { "Detroit", "Away", "Large ", "Lose" };
        Console.WriteLine("Raw (string) training data for team seattle:\n");
        Console.WriteLine("Opponent Field  Spread   Result");
        Console.WriteLine("===============================");
        ShowMatrix(rawTrain);
        Console.WriteLine("\nConverting and storing training data");
        int[][] train = RawTrainToInt(rawTrain, values);
        Console.WriteLine("Training data in int form:\n");
        ShowMatrix(train, true);
        Console.WriteLine(
          "\nCreating weak categorical stump learners from training data");
        List<Learner> learners = MakeLearners(values, train);
        Console.WriteLine("Completed. Weak learners are:\n");
        for (int i = 0; i < learners.Count; ++i)
          Console.WriteLine("[" + i + "] " + Description(learners[i],
            features, values));
        Console.WriteLine("\nInitializing list of best learner indexes");
        List<int> bestLearners = new List<int>();  // Indexes of good weak learners
        Console.WriteLine(
          "\nUsing adaptive boosting to find best  learners and alphas");
        MakeModel(train, values, learners, bestLearners);
        Console.WriteLine("\nModel completed");
        int numGood = bestLearners.Count;
        Console.Write("Algorithm found " + numGood + " good learners ");
        Console.WriteLine("and associated alpha values");
        Console.WriteLine("\nThe good learners and their alpha value are:");
        for (int i = 0; i < bestLearners.Count; ++i)
        {
          int lrn = bestLearners[i];
          Console.Write("[" + lrn + "] " +
            learners[lrn].alpha.ToString("F2") + "  ");
        }
        Console.Write("\nPredicting outcome when Opponent = Detroit, ");
        Console.WriteLine("Field = Home, Spread = Small\n");
        int[] unknownTuple = new int[] { 3, 0, 0 }; // Detroit, Home, Small
        int Y = Classify(unknownTuple, learners, bestLearners);
        Console.Write("Predicted Y = " + Y + " => ");
        Console.WriteLine("seattle will " + YValueToString(Y, values));
        Console.WriteLine("\nEnd\n");
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
      }
    } // Main
    // (Many) static methods here
  } // Class Program
  public class Learner  // Weak learner
  {
    // Definition code here
  }
} // ns

The Main method begins by setting up hardcoded strings for “Opponent,” “Field,” “Spread” and “Result” for the features. Then the code in Main sets up hard-coded values for each feature: “Atlanta,” “Buffalo,” “Chicago,” “Detroit,” “Home,” “Away,” “Small,” “Medium,” “Large,” “Lose” and “Win.” To keep my output tidy, I used a hack and inserted a blank space at the end of “Small” and “Large.”

For simplicity, the training data is also hardcoded into the demo program. In many situations your training data will be stored in a text file or a SQL table. In those situations you might want to consider programmatically scanning the training data to determine the feature names (presumably from a text file header line or SQL table column names) and the feature values.

Method RawTrainToInt converts the training data in string form to zero-based integers and stores those integers into an int[][] matrix named train. RawTrainToInt calls a helper named ValueToInt. The train matrix has the dependent variable values (Result) stored in the last column. You may want to store dependent values in a separate column array. In situations with very large training data sets, you might not be able to store the entire training data set into machine memory. In those situations you’ll have to stream through the external data store instead of an internal matrix.

The demo program determines the weak learners using method MakeLearners and a program-defined class Learner. I’ll describe that method and class in detail in the next section of this article. After the weak learners have been created, the demo program calls method MakeModel. MakeModel is the heart of the adaptive boosting algorithm as described in the previous section. The net result is a sorted List, named bestLearners, of the indexes of the learners that were assigned alpha values.

The Main method finishes by predicting the outcome for Seattle for a set of inputs with Opponent of “Detroit,” Field of “Home” and Spread of “Small,” using method Classify. In this case, the return value from Classify is -0.72, which is interpreted as “Lose.”

Making the Weak Learners

The implementation of a specific adaptive boosting algorithm depends to some extent on the specific definition of a weak learner. Program-defined class Learner has six fields:

public int feature;
public int value;
public int predicted;
public double error;
public double epsilon;
public double alpha;

I declared all five fields with public scope for simplicity. The feature field holds an integer that indicates which independent variable is the key to the learner. For example, if feature is 0, the weak learner is based on the value of an opponent. The value field holds an integer that indicates the value of the feature. For example, if value is 3, the weak learner is based on the condition opponent is Detroit. The predicted field is -1 or +1, depending on whether the actual category for the feature value is Lose or Win.

The error field is type double and is the raw error rate associated with the weak learner on the training data. For example, if a weak learner has feature = 0, and value = 3, and predicted = +1 (meaning if Opponent is Detroit then result is Win), then the raw error rate for the training data in Figure 1 is 0.33 because one out of three training data items would be incorrectly predicted. Notice that raw error treats each training item equally. It turns out that the adaptive boosting algorithm presented in this article doesn’t really need the raw error field, so that field could’ve been omitted, but I believe the information is useful.

The epsilon field is a weighted error term. The epsilon for a weak learner is an error term that takes into account the internal D weights assigned to each training item. The epsilon values are used by the adaptive boosting algorithm to compute the alpha weights. To summarize, there are two sets of weights used in adaptive boosting classification. The alpha weights assign an importance to each weak learner and are used to determine an overall prediction. An epsilon error is an internal error associated with a weak learner that’s used to compute the alpha weights. Each training tuple has an internal weight (given the name D in adaptive boosting literature) that’s used to compute the epsilon errors.

In pseudo-code, method MakeLearners works like this:

initialize an empty result list of learners
for each feature loop
  for each value of curr feature loop
    scan training data to determine most likely -1, +1 result
    if no most likely result, skip curr value
    create a new learner object with feature, value, predicted
    add learner object to result list
  end each value
end each feature
for each learner in result list
  for each training data item
    if learner isn't applicable to data, skip curr data item
    compute raw error rate
    store raw error into learner
  end each training data item
end each learner

The idea is that each feature value such as “Atlanta” (opponent) or “Medium” (point spread) will generate a weak learner rule based on the training data unless the value doesn’t appear in the training data (for example, an opponent of “New York”) or the value doesn’t produce a most-likely result because there are the same number of wins and losses associated with the value (for example, when opponent is “Atlanta” in the demo data, with one win and one loss).

Wrapping Up

An important variation on the algorithm presented in this article is dealing with data that has numeric values. For example, suppose that the values for the point spread feature, instead of being categorical “Small,” “Medium” and “Large,” were numeric, such as 1.5, 3.0 and 9.5. One of the major advantages of adaptive boosting classification compared with some other classification techniques is that adaptive boosting can easily handle both categorical and numeric data directly. You could create a dedicated learner class that has a friendly description similar to “if Spread is less than or equal to 5.5 then Result is Lose,” or a more-complex learner along the lines of “if Spread is between 4.0 and 6.0 then Result is Win.”

In my opinion, adaptive boosting classification is best used when the dependent variable to be predicted has just two possible values. However, advanced forms of adaptive boosting can deal with multinomial classification. If you wish to investigate this or the theory behind adaptive boosting in general, I recommend searching for articles by researchers R. Schapire and Y. Freund.

Machine-learning research suggests that there’s no single best data classification/prediction technique. But adaptive boosting is a very powerful approach that can form the basis of adding predictive features to a .NET software system.

Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Darren Gehring (Microsoft)

CLR - Classification and Prediction Using Adaptive Boosting

The Adaptive Boosting Algorithm

Overall Program Structure

Making the Weak Learners

Wrapping Up

Additional resources