February 2015

Volume 30 Number 2

# Test Run - L1 and L2 Regularization for Machine Learning

L1 regularization and L2 regularization are two closely related techniques that can be used by machine learning (ML) training algorithms to reduce model overfitting. Eliminating overfitting leads to a model that makes better predictions. In this article I’ll explain what regularization is from a software developer’s point of view. The ideas behind regularization are a bit tricky to explain, not because they’re difficult, but rather because there are several interrelated ideas

In this article I illustrate regularization with logistic regression (LR) classification, but regularization can be used with many types of machine learning, notably neural network classification. The goal of LR classification is to create a model that predicts a variable that can take one of two possible values. For example, you might want to predict the result for a football team (lose = 0, win = 1) in an upcoming game based on the team’s current winning percentage (x1), field location (x2), and number of players absent due to injury (x3).

If Y is the predicted value, an LR model for this problem would take the form:

```
z = b0 + b1(x1) + b2(x2) + b3(x3)
Y = 1.0 / (1.0 + e^-z)
```

Here b0, b1, b2 and b3 are weights, which are just numeric values that must be determined. In words, you compute a value z that is the sum of input values times b-weights, add a b0 constant, then pass the z value to the equation that uses math constant e. It turns out that Y will always be between 0 and 1. If Y is less than 0.5, you conclude the predicted output is 0 and if Y is greater than 0.5 you conclude the output is 1. Notice that if there are n features, there will be n+1 b-weights.

For example, suppose a team currently has a winning percentage of 0.75, and will be playing at their opponent’s field (-1), and has 3 players out due to injury. And suppose b0 = 5.0, b1 = 8.0, b2 = 3.0, and b3 = -2.0. Then z = 5.0 + (8.0)(0.75) + (3.0)(-1) + (-2.0)(3) = 2.0 and so Y = 1.0 / (1.0 + e^-2.0) = 0.88. Because Y is greater than 0.5, you’d predict the team will win their upcoming game.

I think the best way to explain regularization is by examining a concrete example. Take a look at the screenshot of a demo program in **Figure 1**. Rather than use real data, the demo program begins by generating 1,000 synthetic data items. Each item has 12 predictor variables (often called “features” in ML terminology). The dependent variable value is in the last column. After creating the 1,000 data items, the data set was randomly split into an 800-item training set to be used to find the model b-weights, and a 200-item test set to be used to evaluate the quality of the resulting model.

**Figure 1 Regularization with Logistic Regression Classification**

Next, the demo program trained the LR classifier, without using regularization. The resulting model had 85.00 percent accuracy on the training data, and 80.50 percent accuracy on the test data. The 80.50 percent accuracy is the more relevant of the two values, and is a rough estimate of how accurate you could expect the model to be when presented with new data. As I’ll explain shortly, the model was over-fitted, leading to mediocre prediction accuracy.

Next, the demo did some processing to find a good L1 regularization weight and a good L2 regularization weight. Regularization weights are single numeric values that are used by the regularization process. In the demo, a good L1 weight was determined to be 0.005 and a good L2 weight was 0.001.

The demo first performed training using L1 regularization and then again with L2 regularization. With L1 regularization, the resulting LR model had 95.00 percent accuracy on the test data, and with L2 regularization, the LR model had 94.50 percent accuracy on the test data. Both forms of regularization significantly improved prediction accuracy.

This article assumes you have at least intermediate programming skills, but doesn’t assume you know anything about L1 or L2 regularization. The demo program is coded using C#, but you shouldn’t have too much difficulty refactoring the code to another language such as JavaScript or Python.

The demo code is too long to present here, but complete source code is available in the code download that accompanies this article. The demo code has all normal error checking removed to keep the main ideas as clear as possible and the size of the code small.

## Overall Program Structure

The overall program structure, with some minor edits to save space, is presented in **Figure 2**. To create the demo, I launched Visual Studio and created a new C# console application named Regularization. The demo has no significant Microsoft .NET Framework dependencies, so any recent version of Visual Studio will work.

Figure 2 Overall Program Structure

```
using System;
namespace Regularization
{
class RegularizationProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin L1 and L2 Regularization demo");
int numFeatures = 12;
int numRows = 1000;
int seed = 42;
Console.WriteLine("Generating " + numRows +
" artificial data items with " + numFeatures + " features");
double[][] allData = MakeAllData(numFeatures, numRows, seed);
Console.WriteLine("Creating train and test matrices");
double[][] trainData;
double[][] testData;
MakeTrainTest(allData, 0, out trainData, out testData);
Console.WriteLine("Training data: ");
ShowData(trainData, 4, 2, true);
Console.WriteLine("Test data: ");
ShowData(testData, 3, 2, true);
Console.WriteLine("Creating LR binary classifier");
LogisticClassifier lc = new LogisticClassifier(numFeatures);
int maxEpochs = 1000;
Console.WriteLine("Starting training using no regularization");
double[] weights = lc.Train(trainData, maxEpochs,
seed, 0.0, 0.0);
Console.WriteLine("Best weights found:");
ShowVector(weights, 3, weights.Length, true);
double trainAccuracy = lc.Accuracy(trainData, weights);
Console.WriteLine("Prediction accuracy on training data = " +
trainAccuracy.ToString("F4"));
double testAccuracy = lc.Accuracy(testData, weights);
Console.WriteLine("Prediction accuracy on test data = " +
testAccuracy.ToString("F4"));
Console.WriteLine("Seeking good L1 weight");
double alpha1 = lc.FindGoodL1Weight(trainData, seed);
Console.WriteLine("L1 weight = " + alpha1.ToString("F3"));
Console.WriteLine("Seeking good L2 weight");
double alpha2 = lc.FindGoodL2Weight(trainData, seed);
Console.WriteLine("L2 weight = " + alpha2.ToString("F3"));
Console.WriteLine("Training with L1 regularization, " +
"alpha1 = " + alpha1.ToString("F3"));
weights = lc.Train(trainData, maxEpochs, seed, alpha1, 0.0);
Console.WriteLine("Best weights found:");
ShowVector(weights, 3, weights.Length, true);
trainAccuracy = lc.Accuracy(trainData, weights);
Console.WriteLine("Prediction accuracy on training data = " +
trainAccuracy.ToString("F4"));
testAccuracy = lc.Accuracy(testData, weights);
Console.WriteLine("Prediction accuracy on test data = " +
testAccuracy.ToString("F4"));
Console.WriteLine("Training with L2 regularization, " +
"alpha2 = " + alpha2.ToString("F3"));
weights = lc.Train(trainData, maxEpochs, seed, 0.0, alpha2);
Console.WriteLine("Best weights found:");
ShowVector(weights, 3, weights.Length, true);
trainAccuracy = lc.Accuracy(trainData, weights);
Console.WriteLine("Prediction accuracy on training data = " +
trainAccuracy.ToString("F4"));
testAccuracy = lc.Accuracy(testData, weights);
Console.WriteLine("Prediction accuracy on test data = " +
testAccuracy.ToString("F4"));
Console.WriteLine("End Regularization demo");
Console.ReadLine();
}
static double[][] MakeAllData(int numFeatures,
int numRows, int seed) { . . }
static void MakeTrainTest(double[][] allData, int seed,
out double[][] trainData, out double[][] testData) { . . }
public static void ShowData(double[][] data, int numRows,
int decimals, bool indices) { . . }
public static void ShowVector(double[] vector, int decimals,
int lineLen, bool newLine) { . . }
}
public class LogisticClassifier
{
private int numFeatures;
private double[] weights;
private Random rnd;
public LogisticClassifier(int numFeatures) { . . }
public double FindGoodL1Weight(double[][] trainData,
int seed) { . . }
public double FindGoodL2Weight(double[][] trainData,
int seed) { . . }
public double[] Train(double[][] trainData, int maxEpochs,
int seed, double alpha1, double alpha2) { . . }
private void Shuffle(int[] sequence) { . . }
public double Error(double[][] trainData, double[] weights,
double alpha1, double alpha2) { . . }
public double ComputeOutput(double[] dataItem,
double[] weights) { . . }
public int ComputeDependent(double[] dataItem,
double[] weights) { . . }
public double Accuracy(double[][] trainData,
double[] weights) { . . }
public class Particle { . . }
}
} // ns
```

After the template code loaded into the Visual Studio editor, in the Solution Explorer window I renamed file Program.cs to the more descriptive RegularizationProgram.cs and Visual Studio automatically renamed class Program for me. At the top of the source code, I deleted all using statements that pointed to unneeded namespaces, leaving just the reference to the top-level System namespace.

All of the logistic regression logic is contained in a single LogisticClassifier class. The LogisticClassifier class contains a nested helper Particle class to encapsulate particle swarm optimization (PSO), the optimization algorithm used for training. Note that the LogisticClassifier class contains a method Error, which accepts parameters named alpha1 and alpha2. These parameters are the regularization weights for L1 and L2 regularization.

In the Main method, the synthetic data is created with these statements:

```
int numFeatures = 12;
int numRows = 1000;
int seed = 42;
double[][] allData = MakeAllData(numFeatures, numRows, seed);
```

The seed value of 42 was used only because that value gave nice, representative demo output. Method MakeAllData generates 13 random weights between -10.0 and +10.0 (one weight for each feature, plus the b0 weight). Then the method iterates 1,000 times. On each iteration, a random set of 12 input values is generated, then an intermediate logistic regression output value is calculated using the random weights. An additional random value is added to the output to make the data noisy and more prone to overfitting.

The data is split into an 800-item set for training and a 200-item set for model evaluation with these statements:

```
double[][] trainData;
double[][] testData;
MakeTrainTest(allData, 0, out trainData, out testData);
```

A logistic regression prediction model is created with these statements:

```
LogisticClassifier lc = new LogisticClassifier(numFeatures);
int maxEpochs = 1000;
double[] weights = lc.Train(trainData, maxEpochs, seed, 0.0, 0.0);
ShowVector(weights, 4, weights.Length, true);
```

Variable maxEpochs is a loop counter limiting value for the PSO training algorithm. The two 0.0 arguments passed to method Train are the L1 and L2 regularization weights. By setting those weights to 0.0, no regularization is used. The model’s quality is evaluated with these two statements:

```
double trainAccuracy = lc.Accuracy(trainData, weights);
double testAccuracy = lc.Accuracy(testData, weights);
```

One of the downsides to using regularization is that the regularization weights must be determined. One approach for finding good regularization weights is to use manual trial and error, but a programmatic technique is usually better. A good L1 regularization weight is found and then used with these statements:

```
double alpha1 = lc.FindGoodL1Weight(trainData, seed);
weights = lc.Train(trainData, maxEpochs, seed, alpha1, 0.0);
trainAccuracy = lc.Accuracy(trainData, weights);
testAccuracy = lc.Accuracy(testData, weights);
```

The statements for training the LR classifier using L2 regularization are just like those for using L1 regularization:

```
double alpha2 = lc.FindGoodL2Weight(trainData, seed);
weights = lc.Train(trainData, maxEpochs, seed, 0.0, alpha2);
trainAccuracy = lc.Accuracy(trainData, weights);
testAccuracy = lc.Accuracy(testData, weights);
```

In the demo, the alpha1 and alpha2 values were determined using the LR object public-scope methods FindGoodL1Weight and FindGoodL2Weight and then passed to method Train. An alternative design is suggested by calling this code:

```
bool useL1 = true;
bool useL2 = false:
lc.Train(traiData, maxEpochs, useL1, useL2);
```

This design approach allows the training method to determine the regularization weights and leads to a bit cleaner interface.

## Understanding Regularization

Because L1 and L2 regularization are techniques to reduce model overfitting, in order to understand regularization, you must understand overfitting. Loosely speaking, if you train a model too much, you will eventually get weights that fit the training data extremely well, but when you apply the resulting model to new data, the prediction accuracy is very poor.

Overfitting is illustrated by the two graphs in **Figure 3**. The first graph shows a hypothetical situation where the goal is to classify two types of items, indicated by red and green dots. The smooth blue curve represents the true separation of the two classes, with red dots belonging above the curve and green dots belonging below the curve. Notice that because of random errors in the data, two of the red dots are below the curve and two green dots are above the curve. Good training, where overfitting doesn’t occur, would result in weights that correspond to the smooth blue curve. Suppose a new data point came in at (3, 7). The data item would be above the curve and be correctly predicted to be class red.

**Figure 3 Model Overfitting**

The second graph in **Figure 3** has the same dots but a different blue curve that is a result of overfitting. This time all the red dots are above the curve and all the green dots are below the curve. But the curve is too complex. A new data item at (3, 7) would be below the curve and be incorrectly predicted as class green.

Overfitting generates non-smooth prediction curves, in other words, those that are not “regular.” Such poor, complex prediction curves are usually characterized by weights that have very large or very small values. Therefore, one way to reduce overfitting is to prevent model weights from becoming very small or large. This is the motivation for regularization.

When an ML model is being trained, you must use some measure of error to determine good weights. There are several different ways to measure error. One of the most common techniques is the mean squared error, where you find the sum of squared differences between the computed output values for a set of weight values and the known, correct output values in the training data, and then divide that sum by the number of training items. For example, suppose for a candidate set of logistic regression weights, with just three training items, the computed outputs and correct output values (sometimes called the desired or target values) are:

```
computed desired
0.60 1.0
0.30 0.0
0.80 1.0
```

Here, the mean squared error would be:

```
((0.6 - 1.0)^2 + (0.3 - 0.0)^2 + (0.8 - 1.0)^2) / 3 =
(0.16 + 0.09 + 0.04) / 3 =
0.097
```

Expressed symbolically, mean squared error can be written:

```
E = Sum(o - t)^2 / n
```

where Sum represents the accumulated sum over all training items, o represents computed output, t is target output and n is the number of training data items. The error is what training minimizes using one of about a dozen numerical techniques with names like gradient descent, iterative Newton-Raphson, L-BFGS, back-propagation and swarm optimization.

In order to prevent the magnitude of model weight values from becoming large, the idea of regularization is to penalize weight values by adding those weight values to the calculation of the error term. If weight values are included in the total error term that’s being minimized, then smaller weight values will generate smaller error values. L1 weight regularization penalizes weight values by adding the sum of their absolute values to the error term. Symbolically:

```
E = Sum(o - t)^2 / n + Sum(Abs(w))
```

L2 weight regularization penalizes weight values by adding the sum of their squared values to the error term. Symbolically:

```
E = Sum(o - t)^2 / n + Sum(w^2)
```

Suppose for this example there are four weights to be determined and their current values are (2.0, -3.0, 1.0, -4.0). The L1 weight penalty added to the 0.097 mean squared error would be (2.0 + 3.0 + 1.0 + 4.0) = 10.0. The L2 weight penalty would be 2.0^2 + -3.0^2 + 1.0^2 + -4.0^2 = 4.0 + 9.0 + 1.0 + 16.0 = 30.0.

To summarize, large model weights can lead to overfitting, which leads to poor prediction accuracy. Regularization limits the magnitude of model weights by adding a penalty for weights to the model error function. L1 regularization uses the sum of the absolute values of the weights. L2 regularization uses the sum of the squared values of the weights.

## Why Two Different Kinds of Regularization?

L1 and L2 regularization are similar. Which is better? The bottom line is that even though there are some theory guidelines about which form of regularization is better in certain problem scenarios, in my opinion, in practice you must experiment to find which type of regularization is better, or whether using regularization at all is better.

As it turns out, using L1 regularization can sometimes have a beneficial side effect of driving one or more weight values to 0.0, which effectively means the associated feature isn’t needed. This is one form of what’s called feature selection. For example, in the demo run in **Figure 1**, with L1 regularization the last model weight is 0.0. This means the last predictor value doesn’t contribute to the LR model. L2 regularization limits model weight values, but usually doesn’t prune any weights entirely by setting them to 0.0.

So, it would seem that L1 regularization is better than L2 regularization. However, a downside of using L1 regularization is that the technique can’t be easily used with some ML training algorithms, in particular those algorithms that use calculus to compute what’s called a gradient. L2 regularization can be used with any type of training algorithm.

To summarize, L1 regularization sometimes has a nice side effect of pruning out unneeded features by setting their associated weights to 0.0 but L1 regularization doesn’t easily work with all forms of training. L2 regularization works with all forms of training, but doesn’t give you implicit feature selection. In practice, you must use trial and error to determine which form of regularization (or neither) is better for a particular problem.

## Implementing Regularization

Implementing L1 and L2 regularization is relatively easy. The demo program uses PSO training with an explicit error function, so all that’s necessary is to add the L1 and L2 weight penalties. The definition of method Error begins with:

```
public double Error(double[][] trainData, double[] weights,
double alpha1, double alpha2)
{
int yIndex = trainData[0].Length - 1;
double sumSquaredError = 0.0;
for (int i = 0; i < trainData.Length; ++i)
{
double computed = ComputeOutput(trainData[i], weights);
double desired = trainData[i][yIndex];
sumSquaredError += (computed - desired) * (computed - desired);
}
...
```

The first step is to compute the mean squared error by summing the squared differences between computed outputs and target outputs. (Another common form of error is called cross-entropy error.) Next, the L1 penalty is calculated:

```
double sumAbsVals = 0.0; // L1 penalty
for (int i = 0; i < weights.Length; ++i)
sumAbsVals += Math.Abs(weights[i]);
```

Then the L2 penalty is calculated:

```
double sumSquaredVals = 0.0; // L2 penalty
for (int i = 0; i < weights.Length; ++i)
sumSquaredVals += (weights[i] * weights[i]);
```

Method Error returns the MSE plus the penalties:

```
...
return (sumSquaredError / trainData.Length) +
(alpha1 * sumAbsVals) +
(alpha2 * sumSquaredVals);
}
```

The demo uses an explicit error function. Some training algorithms, such as gradient descent and back-propagation, use the error function implicitly by computing the calculus partial derivative (called the gradient) of the error function. For those training algorithms, to use L2 regularization (because the derivative of w^2 is 2w), you just add a 2w term to the gradient (although the details can be a bit tricky).

## Finding Good Regularization Weights

There are several ways to find a good (but not necessarily optimal) regularization weight. The demo program sets up a set of candidate values, computes the error associated with each candidate, and returns the best candidate found. The method to find a good L1 weight begins:

```
public double FindGoodL1Weight(double[][] trainData, int seed)
{
double result = 0.0;
double bestErr = double.MaxValue;
double currErr = double.MaxValue;
double[] candidates = new double[] { 0.000, 0.001, 0.005,
0.010, 0.020, 0.050, 0.100, 0.150 };
int maxEpochs = 1000;
LogisticClassifier c =
new LogisticClassifier(this.numFeatures);
```

Adding additional candidates would give you a better chance of finding an optimal regularization weight at the expense of time. Next, each candidate is evaluated, and the best candidate found is returned:

```
for (int trial = 0; trial < candidates.Length; ++trial) {
double alpha1 = candidates[trial];
double[] wts = c.Train(trainData, maxEpochs, seed, alpha1, 0.0);
currErr = Error(trainData, wts, 0.0, 0.0);
if (currErr < bestErr) {
bestErr = currErr; result = candidates[trial];
}
}
return result;
}
```

Notice the candidate regularization weight is used to train the evaluation classifier, but the error is computed without the regularization weight.

## Wrapping Up

Regularization can be used with any ML classification technique that’s based on a mathematical equation. Examples include logistic regression, probit classification and neural networks. Because it reduces the magnitudes of the weight values in a model, regularization is sometimes called weight decay. The major advantage of using regularization is that it often leads to a more accurate model. The major disadvantage is that it introduces an additional parameter value that must be determined, the regularization weight. In the case of logistic regression this isn’t too serious because there’s usually just the learning rate parameter, but when using more complex classification techniques, neural networks in particular, adding another so-called hyperparameter can create a lot of additional work to tune the combined values of the parameters.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.*

Thanks to the following technical expert at Microsoft Research for reviewing this article: Richard Hughes