April 2015

Volume 30 Number 4

Test Run - Multi-Class Logistic Regression Classification

By James McCaffrey

James McCaffreyI consider logistic regression (LR) classification to be the “Hello, world!” of machine learning (ML). In standard LR classification, the goal is to predict the value of some variable that can take on just one of two categorical values. For example, you might want to predict a person’s sex (male or female) based on their height and annual income.

Multi-class LR classification extends standard LR by allowing the variable to predict to have three or more values. For example, you might want to predict a person’s political inclination (conservative, moderate or liberal) based on predictor variables such as age, annual income and so on. In this article, I’ll explain how multi-class LR works and show you how to implement it using C#.

The best way to understand where this article is headed is to take a look at the demo program in Figure 1. The demo begins by generating 1,000 lines of synthetic data with four predictor variables (also called features), where the variable to predict can take on one of three values. For example, a line of generated data might resemble:

5.33  -4.89  0.15  -6.67  0.00  1.00  0.00

Multi-Class Logistic Regression in Action
Figure 1 Multi-Class Logistic Regression in Action

The first four values are predictor values that represent real-life data that has been normalized so a value of 0.0 is exactly average for the feature, values greater than 0.0 are larger than the feature average, and values less than 0.0 are smaller than the feature average. The last three values are a 1-of-N encoded version of the variable to predict. For example, if you’re trying to predict political inclination, then (1 0 0) represents conservative, (0 1 0) represents moderate and (0 0 1) represents liberal.

After the synthetic data was generated, it was randomly split into a training set (80 percent of the data, or 800 lines) and a test set (the remaining 200 lines). The training data is used to create the prediction model, and the test data is used to estimate the predictive accuracy of the model on new data where the value to predict isn’t known.

A multi-class LR model with f features and c classes will have (f * c) weights and c biases. These are numeric constants that must be determined. For the demo, there are 4 * 3 = 12 weights, and 3 biases. Training is the process of estimating the values of the weights and biases. Training is an iterative process and the demo sets the maximum number of training iterations (often called epochs in ML literature) to 100. The technique used to train the multi-class LR classifier is called batch gradient descent. This technique requires values for two parameters called the learning rate and the weight decay rate. These two values are typically found by trial and error and the demo assigns values of 0.01 and 0.10, respectively.

During training, the demo displays a progress message every 10 epochs. If you look at the messages in Figure 1, you can see that training converged very quickly and there was no improvement after the first 20 epochs.

After training completed, the demo displayed the best values found for the 12 weights and 3 biases. These 15 values define the multi-class LR model. Using those values, the demo computed the predictive accuracy of the model on the training data (92.63 percent correct, or 741 out of 800) and on the test data (90.00 percent correct, or 180 out of 200).

This article assumes you have intermediate or advanced programming skills, but doesn’t assume you know anything about multi-class logistic regression. The demo program is coded using C#, but you should be able to refactor the demo to other programming languages without too much trouble.

Understanding Multi-Class Logistic Regression

Suppose you want to predict the political inclination (conservative, moderate, liberal) of a person based on age (x0), annual income (x1), height (x3) and education level (x4). You encode political inclination with three variables as (y0, y1, y2), where conservative is (1, 0, 0), moderate is (0, 1, 0) and liberal is (0, 0, 1). A multi-class LR model for this problem would take the form:

z0 = (w00)(x0) + (w01)(x1) + (w02)(x2) + b0
y0 = 1.0 / (1.0 + e^-z0)
z1 = (w10)(x0) + (w11)(x1) + (w12)(x2) + b1
y1 = 1.0 / (1.0 + e^-z1)
z2 = (w20)(x0) + (w21)(x1) + (w22)(x2) + b2
y2 = 1.0 / (1.0 + e^-z2)

Here, wij is the weight value associated with feature variable i and class variable j, and bj is the bias value associated with class variable j.

An example of multi-class LR is illustrated in Figure 2. One training data item has four predictor values (5.10, -5.20, 5.30, -5.40) followed by three output values (0, 0, 1). The predictor values are arbitrary, but you can imagine they represent a person whose age is greater than average, income is less than average, height is greater than average, and education level is less than average, and the person has a political inclination of liberal.

Multi-Class Logistic Regression Data Structures
Figure 2 Multi-Class Logistic Regression Data Structures

Each of the three columns of the weights matrix corresponds to one of the three class values. The four values in each column correspond to the four predictor x-values. The biases array holds an additional, special weight—one for each class—which isn’t associated with a predictor.

Notice the biases array could’ve been stored as an additional row in the weights matrix. This is often done in research papers because it simplifies the math equations. But for demo implementation purposes, maintaining a separate weights matrix and a biases array is slightly easier to understand, in my opinion.

 In multi-class LR, output values are calculated for each class. In Figure 2, the computed output values are (0.32, 0.33, 0.35). The output values sum to 1.0 and can be interpreted as probabilities. Because the last output value is (barely) the largest of the three, you conclude the outputs correspond to (0, 0, 1). In this example, the computed outputs match the three outputs in the training data item, so the model has made a correct prediction.

The output values are computed by first summing the products of each input value times its corresponding weight value, then adding the corresponding bias value. These sums of products are often called z-values. The z-values are fed to what is called the logistic sigmoid function: 1.0 / (1.0 + e^-z) where e is the math constant and ‘^’ means exponentiation. Although it’s not apparent from Figure 2, the result of the logistic sigmoid function will always be between 0.0 and 1.0.

Each of the logistic sigmoid values is used to compute the final output values. The logistic sigmoid values are summed and used as a divisor. This process is called the softmax function. If you’re new to all these LR concepts, they can be very confusing at first. But if you go over the example in Figure 2 a few times, you’ll eventually see that LR is not as complicated as it first appears.

From where do the weights and biases values come? The process of determining these values is called training the model. The idea is to use a set of training data that has known input and output values, and then try different values for the weights and biases until you find a set of values that minimizes the error between computed outputs and the correct output values (often called the target values) in the training data.

It’s not feasible to calculate the exact values for the weights and biases, so weights and biases must be estimated. There are several so-called numerical optimization techniques that can be used to do this. Common techniques include the L-BFGS algorithm, the iteratively reweighted least squares method, and particle swarm optimization. The demo program uses a technique that’s somewhat confusingly called both gradient descent (minimizing error between computed and known output values) and gradient ascent (maximizing the probability that weights and biases are optimal).

Demo Program Structure

The structure of the demo program, with some minor edits to save space, is presented in Figure 3. To create the demo program, I launched Visual Studio and selected the C# Console Application template. I named the project LogisticMultiClassGradient. The demo has no significant .NET dependencies so any version of Visual Studio will work. The demo is too long to present in its entirety, but all the source code is available in the download that accompanies this article. I removed all normal error checking to keep the main ideas as clear as possible.

Figure 3 Demo Program Structure

using System;
namespace LogisticMultiClassGradient
  class LogisticMultiProgram
    static void Main(string[] args)
      Console.WriteLine("Begin classification demo");
      Console.WriteLine("End demo");
    public static void ShowData(double[][] data,
      int numRows, int decimals, bool indices) { . . }
    public static void ShowVector(double[] vector,
      int decimals, bool newLine) { . . }
    static double[][] MakeDummyData(int numFeatures,
      int numClasses, int numRows, int seed) { . . }
    static void SplitTrainTest(double[][] allData,
      double trainPct, int seed, out double[][] trainData,
      out double[][] testData) { . . }
  public class LogisticMulti
    private int numFeatures;
    private int numClasses;
    private double[][] weights; // [feature][class]
    private double[] biases;    // [class]
    public LogisticMulti(int numFeatures,
      int numClasses) { . . }
    private double[][] MakeMatrix(int rows,
      int cols) { . . }
    public void SetWeights(double[][] wts,
      double[] b) { . . }
    public double[][] GetWeights() { . . }
    public double[] GetBiases() { . . }
    private double[] ComputeOutputs(double[] dataItem) { . . }
    public void Train(double[][] trainData, int maxEpochs,
      double learnRate, double decay) { . . }
    public double Error(double[][] trainData) { . . }
    public double Accuracy(double[][] trainData) { . . }
    private static int MaxIndex(double[] vector) { . . }
    private static int MaxIndex(int[] vector) { . . }
    private int[] ComputeDependents(double[] dataItem) { . . }

After the template code loaded, in the Solution Explorer window I right-clicked file Program.cs and renamed it to the more descriptive LogisticMultiProgram.cs and Visual Studio automatically renamed class Program for me. In the editor window, at the top of the source code, I deleted all unneeded using statements, leaving just the one referencing the top-level System namespace.

The LogisticMultiProgram class contains helper methods MakeDummyData, SplitTrainTest, ShowData and ShowVector. These methods create and display the synthetic data. All the classification logic is contained in a program-defined class named LogisticMulti.

The Main method creates the synthetic data with these statements:

int numFeatures = 4;
int numClasses = 3;
int numRows = 1000;
double[][] data = MakeDummyData(numFeatures, 
  numClasses, numRows, 0);

Method MakeDummyData generates a set of random weights and biases, then for each row of data, generates random input values, combines the weights and biases and input values, and calculates some corresponding 1-of-N encoded output values. The synthetic data is split into 80 percent training and 20 percent test sets, like so:

double[][] trainData;
double[][] testData;
SplitTrainTest(data, 0.80, 7, out trainData, out testData);
ShowData(trainData, 3, 2, true);
ShowData(testData, 3, 2, true);

The argument with value 7 is a random seed, used only because it provided a nice-looking demo. The multi-class LR classifier is created and trained with these statements:

LogisticMulti lc = new LogisticMulti(numFeatures, numClasses);
int maxEpochs = 100;
double learnRate = 0.01;
double decay = 0.10;
lc.Train(trainData, maxEpochs, learnRate, decay);

The values for training parameters maxEpochs (100), the learning rate (0.01), and the weight decay rate (0.10) were determined by trial and error. Tuning most ML training methods typically requires some experimentation to get good prediction accuracy.

After training, the best weights and biases values are stored in the LogisticMulti object. They’re retrieved and displayed like this:

double[][] bestWts = lc.GetWeights();
double[] bestBiases = lc.GetBiases();
ShowData(bestWts, bestWts.Length, 3, true);
ShowVector(bestBiases, 3, true);

An alternative design to using a void Train method combined with Get methods is to refactor method Train so that it returns the best-weights matrix and best-biases array as out-parameters, or in a combined array. The quality of the trained model is evaluated like so:

double trainAcc = lc.Accuracy(trainData, weights);
double testAcc = lc.Accuracy(testData, weights);

The accuracy of the model on the test data is the more relevant of the two accuracy values. It provides you with a rough estimate of how accurate the model would be when presented with new data with unknown output values.

Implementing Multi-Class LR Training

The LogisticMulti class constructor is defined as:

public LogisticMulti(int numFeatures, int numClasses)
  this.numFeatures = numFeatures;
  this.numClasses = numClasses;
  this.weights = MakeMatrix(numFeatures, numClasses);
  this.biases = new double[numClasses];

Method MakeMatrix is a helper method that allocates memory for an array-of-arrays-style matrix. The weights matrix and biases array are implicitly initialized to all 0.0 values. An alternative that some researchers prefer is to explicitly initialize weights and biases to small (typically between 0.001 and 0.01) random values.

The definition of method ComputeOutputs is presented in Figure 4. The method returns an array of values, one for each class, where each value is between 0.0 and 1.0 and the values sum to 1.0.

Figure 4 Method ComputeOutputs

private double[] ComputeOutputs(double[] dataItem)
  double[] result = new double[numClasses];
  for (int j = 0; j < numClasses; ++j) // compute z
    for (int i = 0; i < numFeatures; ++i)
      result[j] += dataItem[i] * weights[i][j];
    result[j] += biases[j];
  for (int j = 0; j < numClasses; ++j) // 1 / 1 + e^-z
    result[j] = 1.0 / (1.0 + Math.Exp(-result[j]));
  double sum = 0.0; // softmax scaling
  for (int j = 0; j < numClasses; ++j)
    sum += result[j];
  for (int j = 0; j < numClasses; ++j)
    result[j] = result[j] / sum;
  return result;

The class definition contains a similar method, ComputeDependents:

private int[] ComputeDependents(double[] dataItem)
  double[] outputs = ComputeOutputs(dataItem); // 0.0 to 1.0
  int maxIndex = MaxIndex(outputs);
  int[] result = new int[numClasses];
  result[maxIndex] = 1;
  return result;

Method ComputeDependents returns an integer array where one value is 1 and the other values are 0. These computed output values can be compared to the known target output values in the training data to determine whether the model has made a correct prediction, which in turn can be used to calculate prediction accuracy.

Expressed in very high-level pseudo-code, method Train is:

loop maxEpochs times
  compute all weight gradients
  compute all bias gradients
  use weight gradients to update all weights
  use bias gradients to update all biases

Each weight and bias value has an associated gradient value. Loosely speaking, a gradient is a value that indicates how far away, and in what direction (positive or negative) a computed output value is compared to the target output value. For example, suppose for one weight, if the values of all other weights and biases are held constant, a computed output value is 0.7 and the target output value is 1.0. The computed value is too small, so the gradient is a value of about 0.3, which will be added to the weight. If the value of the weight increases, the computed output value will increase. I’ve left out some details, but the basic idea is fairly simple.

The mathematics behind gradient training uses calculus and is very complex, but fortunately, you don’t need to fully understand the math to implement the code. The definition of method Train begins with:

public void Train(double[][] trainData, int maxEpochs,
  double learnRate, double decay)
  double[] targets = new double[numClasses];
  int msgInterval = maxEpochs / 10;
  int epoch = 0;
  while (epoch < maxEpochs)

The targets array will hold the correct output values stored in a training data item. Variable msgInterval controls the number of times to display progress messages. Then, progress messages are displayed:

if (epoch % msgInterval == 0 && epoch != maxEpochs)
  double mse = Error(trainData);
  Console.Write("epoch = " + epoch);
  Console.Write(" error = " + mse.ToString("F4"));
  double acc = Accuracy(trainData);
  Console.WriteLine(" accuracy = " + acc.ToString("F4"));

Because ML training usually involves some trial and error, displaying progress messages is very useful. Next, storage for the weights and biases gradients is allocated:

double[][] weightGrads = MakeMatrix(numFeatures, numClasses);
double[] biasGrads = new double[numClasses];

Notice these allocations occur inside the main while-loop. Because C# initializes arrays to 0.0, all gradients are initialized. An alternative is to allocate space outside the while-loop and then call helper methods with names like ZeroMatrix and ZeroArray. Next, all weights gradients are computed:

for (int j = 0; j < numClasses; ++j) {
  for (int i = 0; i < numFeatures; ++i) {
    for (int r = 0; r < trainData.Length; ++r) {
      double[] outputs = ComputeOutputs(trainData[r]);
        for (int k = 0; k < numClasses; ++k)
          targets[k] = trainData[r][numFeatures + k];
        double input = trainData[r][i];
        weightGrads[i][j] += (targets[j] - outputs[j]) * input;

This code is the heart of multi-class LR. Each weight gradient is essentially the difference between a target output value and a computed output value. That difference is multiplied by the associated input value to take into account the fact that the input can be a negative value, which means the weight should be adjusted in the opposite direction.

An interesting alternative that I often use is to ignore the magnitude of the input value and use just its sign:

double input = trainData[r][i];
int sign = (input > 0.0) ? 1 : -1;
weightGrads[i][j] += (targets[j] - outputs[j]) * sign;

In my experience, this technique often leads to a better model. Next, all biases gradients are computed:

for (int j = 0; j < numClasses; ++j) {
  for (int i = 0; i < numFeatures; ++i) {
    for (int r = 0; r < trainData.Length; ++r) {
      double[] outputs = ComputeOutputs(trainData[r]);
      for (int k = 0; k < numClasses; ++k)
        targets[k] = trainData[r][numFeatures + k];
      double input = 1; // 1 is a dummy input
      biasGrads[j] += (targets[j] - outputs[j]) * input;

If you examine the code, you can see that computing the biases gradients could be performed within the for-loops that compute the weights gradients. I separated the two gradient computations for clarity, at the expense of performance. Also, the multiplication by the implied input value of 1 can be dropped. It, too, was added for clarity. Next, the weights are updated:

for (int i = 0; i < numFeatures; ++i) {
  for (int j = 0; j < numClasses; ++j) {
    weights[i][j] += learnRate * weightGrads[i][j];
    weights[i][j] *= (1 - decay);  // wt decay

After increasing or decreasing a weight based on a learning rate fraction of its gradient, the weight value is decreased using the weight decay rate. For example, the demo uses a typical weight decay value of 0.10, so multiplying by (1 - 0.10) is multiplying by 0.90, which is a 10 percent decrease. Weight decay is also called regularization. The technique prevents weight values from spinning out of control. Method Train concludes by updating the biases values:

    for (int j = 0; j < numClasses; ++j) {
      biases[j] += learnRate * biasGrads[j];
      biases[j] *= (1 - decay);
  } // While
} // Train

The training technique updates the class member weights matrix and biases array in place. These values define the multi-class LR model and can be retrieved using Get methods.

Wrapping Up

There are two main variations of gradient training, called batch and stochastic. This article presented the batch version where gradients are calculated by summing the differences between computed and target outputs over all training items. In stochastic gradient training, gradients are estimated by using just individual training data items. Based on some experiments, when applied to multi-class LR, batch training seems to give a more accurate model, but takes longer than stochastic training. This is rather surprising because stochastic training is usually preferable to batch training for neural networks.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash., and has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical experts for reviewing this article: Todd Bello and Alisson Sol