October 2012

Volume 27 Number 10

Test Run - Neural Network Back-Propagation for Programmers

By James McCaffrey | October 2012

James McCaffreyAn artificial neural network can be thought of as a meta-function that accepts a fixed number of numeric inputs and produces a fixed number of numeric outputs. In most situations, a neural network has a layer of hidden neurons where each hidden neuron is fully connected to the input neurons and the output neurons. Associated with each individual hidden neuron and each individual output neuron are a set of weight values and a single so-called bias value. The weights and biases determine the output values for a given set of input values.

When neural networks are used to model a set of existing data so that predictions can be made on new data, the main challenge is to find the set of weight and bias values that generate the outputs that best match the existing data. The most common technique for estimating optimal neural network weights and biases is called back-propagation. Although there are many excellent references that describe the complicated mathematics that underlie back-propagation, there are very few guides available for programmers that clearly explain how to program the back-propagation algorithm. This article explains how to implement back-propagation. I use the C# language, but you should have no trouble refactoring the code presented here to other languages.

The best way to see where I’m headed is to take a look at the screenshot of a demo program in Figure 1. The demo program creates a neural network that has three input neurons, a hidden layer with four neurons, and two output neurons. Neural networks with a single hidden layer need two activation functions. In many situations, though, the two activation functions are the same, typically the sigmoid function. But in this demo, in order to illustrate the relationship between activation functions and back-propagation, I use different activation functions: the sigmoid function for the input-to-hidden computations, and the tanh (hyperbolic tangent) function for the hidden-to-output computations.

Back-Propagation Algorithm in Action
Figure 1 Back-Propagation Algorithm in Action

A fully connected 3-4-2 neural network requires 3*4 + 4*2 = 20 weight values and 4+2 = 6 bias values for a total of 26 weights and biases. These weights and biases are initialized to more or less arbitrary values. The three dummy input values are set to 1.0, 2.0 and 3.0. With the initial weight, bias and input values, the initial output values are computed by the neural network to be {0.7225, -0.8779}. The demo program arbitrarily assumes that the two correct output values are {-0.8500, 0.7500}. The goal of the back-propagation algorithm is to find a new set of weights and biases that generate outputs that are very close to the correct values for inputs {1.0, 2.0, 3.0}.

Back-propagation requires two free parameters. The learning rate, usually given the Greek letter eta in back-propagation literature, controls how fast the algorithm converges to a final estimate. The momentum, usually given the Greek letter alpha, helps the back-propagation algorithm avoid situations in which the algorithm oscillates and never converges to a final estimate. The demo program sets the learning rate to 0.90 and the momentum to 0.04. Typically these values are found by trial and error.

Finding the best set of weights and biases for a neural network is sometimes called training the network. Training with back-propagation is an iterative process. At each iteration, back-propagation computes a new set of neural network weight and bias values that in theory generate output values that are closer to the target values. After the first training iteration of the demo program, the back-propagation algorithm found new weight and bias values that generated new outputs of {-0.8932, -0.8006}. The new first output value of -0.8932 was much closer to the first target output value of -0.8500. The second new output value of -0.8006 was still far away from its target value of 0.7500.

The training process can be terminated in a variety of ways. The demo program iterates training until the sum of the absolute differences between output values and target values is <= 0.01 or the training reaches 1,000 iterations. In the demo, after six iterations of training, back-propagation found a set of neural network weight and bias values that generated outputs of {-0.8423, 0.7481}, which were very close to the {-0.8500, 0.7500} desired target values.

This article assumes you have expert-level programming skills and that you have a very basic understanding of neural networks. (For basic information on neural networks, see my May 2012 article, “Dive into Neural Networks,” at msdn.microsoft.com/magazine/hh975375.) The code for the demo program shown in Figure 1 is a bit too long to present in this article, so I’ll concentrate on explaining the key parts of the algorithm. The complete source code for the demo program is available at msdn.microsoft.com/magazine/msdnmag1012.

Defining a Neural Network Class

Coding a neural network that uses back-propagation lends itself nicely to an object-oriented approach. The class definition used for the demo program is listed in Figure 2.

Figure 2 Neural Network Class

class NeuralNetwork
  private int numInput;
  private int numHidden;
  private int numOutput;
  // 15 input, output, weight, bias, and other arrays here
  public NeuralNetwork(int numInput, 
    int numHidden, int numOutput) {...}
  public void UpdateWeights(double[] tValues, 
    double eta, double alpha) {...}
  public void SetWeights(double[] weights) {...}
  public double[] GetWeights() {...}
  public double[] ComputeOutputs(double[] xValues) {...}
  private static double SigmoidFunction(double x)
    if (x < -45.0) return 0.0;
    else if (x > 45.0) return 1.0;
    else return 1.0 / (1.0 + Math.Exp(-x));
  private static double HyperTanFunction(double x)
    if (x < -10.0) return -1.0;
    else if (x > 10.0) return 1.0;
    else return Math.Tanh(x);

Member fields, numInput, numHidden and numOutput are defining characteristics of the neural network architecture. In addition to a simple constructor, the class has four publicly accessible methods and two helper methods. Method UpdateWeights contains all the logic of the back-propagation algorithm. Method SetWeights accepts an array of weights and biases and copies those values sequen­tially into member arrays. Method GetWeights performs the reverse oper­ation by copying the weights and biases into a single array and returning that array. Method ComputeOutputs determines the neural network output values using the current input, weight and bias values.

Method SigmoidFunction is used as the input-to-hidden activation function. It accepts a real value (type double in C#) and returns a value between 0.0 and 1.0. Method HyperTanFunction also accepts a real value but returns a value between -1.0 and +1.0. The C# language has a built-in hyperbolic tangent function, Math.Tanh, but if you’re using a language that doesn’t have a native tanh function, you’ll have to code one from scratch.

Setting up the Arrays

One of the keys to successfully programming a neural network back-propagation algorithm is to fully understand the arrays that are being used to store weight and bias values, store different kinds of input and output values, store values from a previous iteration of the algorithm, and store scratch calculations. The large diagram in Figure 3 contains all the information you need to know to understand how to program back-propagation. Your initial reaction to Figure 3 is likely to be something along the lines of, “Forget it—this is too complicated.” Hang in there. Back-propagation is not trivial, but once you understand the diagram you’ll be able to implement back-propagation using any programming language.

The Back-Propagation Algorithm
Figure 3 The Back-Propagation Algorithm

Figure 3 has primary inputs and outputs at the edges of the figure, but also several local input and output values that occur in the interior of the diagram. You should not underestimate the difficulty of coding a neural network and the need to keep the names and meanings of all these inputs and outputs clear. Based on my experience, a diagram like the one in Figure 3 is absolutely essential.

 The first five of the 15 arrays used in the neural network definition outlined in Figure 2 deal with the input-to-hidden layers and are:

public class NeuralNetwork
  // Declare numInput, numHidden, numOutput
  private double[] inputs;
  private double[][] ihWeights;
  private double[] ihSums;
  private double[] ihBiases;
  private double[] ihOutputs;

The first array, named inputs, holds the numeric input values. These values typically come directly from some normalized data source such as a text file. The NeuralNetwork constructor instantiates inputs as:

this.inputs = new double[numInput];

Array ihWeights (input-to-hidden weights) is a virtual two-dimensional array implemented as an array of arrays. The first index indicates the input neuron and the second index indicates the hidden neuron. The array is instantiated by the constructor as:

this.ihWeights = Helpers.MakeMatrix(numInput, numHidden);

Here, Helpers is a utility class of static methods that help simplify the neural network class:

public static double[][] MakeMatrix(int rows, int cols)
  double[][] result = new double[rows][];
  for (int i = 0; i < rows; ++i)
    result[i] = new double[cols];
  return result;

Array ihSums is a scratch array that’s used to hold an intermediate calculation in the ComputeOutputs method. The array holds values that will become the local inputs for the hidden neurons and is instantiated as:

this.ihSums = new double[numHidden];

Array ihBiases holds the bias values for the hidden neurons. Neural network weight values are constants that are applied by multiplying them with a local input value. Bias values are added to an intermediate sum to produce a local output value, which becomes the local input to the next layer. Array ihBiases is instantiated as:

this.ihBiases = new double[numHidden];

Array ihOutputs holds the values that are emitted from the hidden-layer neurons (which become the inputs to the output layer).

The next four arrays in the NeuralNetwork class hold values related to the hidden-to-output layer:

private double[][] hoWeights;
private double[] hoSums;
private double[] hoBiases;
private double[] outputs;

These four arrays are instantiated in the constructor as:

this.hoWeights = Helpers.MakeMatrix(numHidden, numOutput);
this.hoSums = new double[numOutput];
this.hoBiases = new double[numOutput];
this.outputs = new double[numOutput];

The neural network class has six arrays that are directly related to the back-propagation algorithm. The first two arrays hold values called the gradients for the output- and hidden-layer neurons. A gradient is a value that indirectly describes how far off, and in what direction (positive or negative), local outputs are relative to the target outputs. Gradient values are used to compute delta values, which are added to current weight and bias values to produce new, better weights and biases. There’s one gradient value for each hidden-layer neuron and each output-layer neuron. The arrays are declared as:

private double[] oGrads; // Output gradients
private double[] hGrads; // Hidden gradients

The arrays are instantiated in the constructor as:

this.oGrads = new double[numOutput];
this.hGrads = new double[numHidden];

The final four arrays in class NeuralNetwork hold the deltas (not gradients) from the previous iteration of the training loop. These previous deltas are required if you use the momentum mechanism to prevent back-propagation non-convergence. I consider momentum essential, but if you decide not to implement momentum you can omit these arrays. They are declared as:

private double[][] ihPrevWeightsDelta;  // For momentum
private double[] ihPrevBiasesDelta;
private double[][] hoPrevWeightsDelta;
private double[] hoPrevBiasesDelta;

These arrays are instantiated as:

ihPrevWeightsDelta = Helpers.MakeMatrix(numInput, numHidden);
ihPrevBiasesDelta = new double[numHidden];
hoPrevWeightsDelta = Helpers.MakeMatrix(numHidden, numOutput);
hoPrevBiasesDelta = new double[numOutput];

Computing Outputs

Each iteration in the training loop shown in Figure 1 has two parts. In the first part, outputs are computed using the current primary inputs, weights and biases. In the second part, back-propagation is used to modify the weights and biases. The diagram in Figure 3 illustrates both parts of the training process.

Working from left to right, inputs x0, x1 and x2 are assigned values of 1.0, 2.0 and 3.0. These primary input values go into the input-layer neurons and are emitted without modification. Although input-layer neurons can modify their input, such as normalizing the values to be within a certain range, in most cases such processing is done externally. Because of this, neural network diagrams often use rectangles or square boxes for the input neurons to indicate they aren’t processing neurons in the same sense that the hidden-layer and output-layer neurons are. Additionally, this affects the terminology used. In some cases, the neural network shown in Figure3 would be called a three-layer network, but because the input layer doesn’t perform processing, the neural network shown is sometimes called a two-layer network.

Next, each of the hidden-layer neurons computes a local input and a local output. For example, the bottommost hidden neuron, with index [3], computes its scratch sum as (1.0)(0.4)+(2.0)(0.8)+(3.0)(1.2) = 5.6. The scratch sum is the product of the sum of the three inputs times the associated input-to-hidden weight. The values above each arrow are the weights. Next, the bias value, -7.0, is added to the scratch sum to yield a local input value of 5.6 + (-7.0) = -1.40. Then the input-to-hidden activation function is applied to this intermediate input value to yield the local output value of the neuron. In this case, the activation function is the sigmoid function, so the local output is 1 / (1 + exp(-(-1.40))) = 0.20.

The output-layer neurons compute their input and output similarly. For example, in Figure 3, the bottommost output-layer neuron with index[1] computes its scratch sum as (0.86)(1.4)+(0.17)(1.6)+(0.98)(1.8)+(0.20)(2.0) = 3.73. The associated bias is added to give the local input: 3.73 + (-5.0) = -1.37. And the activation function is applied to give the primary output: tanh(-1.37) = -0.88. If you examine the code for ComputeOutputs, you’ll see that the method computes outputs exactly as I’ve just described.


Although the math behind the theory of back-propagation is fairly complicated, once you know what those math results are, implementing back-propagation is not too difficult. Back-propagation starts by working from right to left in the diagram shown in Figure 3. The first step is to compute the gradient values for each output-layer neuron. Recall the gradient is a value that has information regarding the magnitude and direction of an error. The gradients for the output-layer neurons are computed differently from the gradients for the hidden-layer neurons.

The gradient of an output-layer neuron is equal to the target (desired) value minus the computed output value, times the calculus derivative of the output-layer activation function evaluated at the computed output value. For example, the gradient value of the bottommost output-layer neuron in Figure 3, with index [1], is computed as:

(0.75 – (-0.88)) * (1 – (-0.88)) * (1 + (-0.88)) = 0.37   

The 0.75 is the desired value. The -0.88 is the computed output value from the forward-pass computation. Recall that in this example the output-layer activation function is the tanh function. The calculus derivative of tanh(x) is (1 - tanh(x)) * (1 + tanh(x)). The math analysis is a bit tricky but, ultimately, computing the gradient of an output-layer neuron is given by the formula described here.

The gradient of a hidden-layer neuron is equal to the calculus derivative of the activation function of the hidden layer evaluated at the local output of the neuron times the sum of the product of the primary outputs times their associated hidden-to-output weights. For example, in Figure 3, the gradient of the bottommost hidden-layer neuron with index [3] is:

(0.20)(1 – 0.20) * [ (-0.76)(1.9) + (0.37)(2.0) ] = -0.03

If we call the sigmoid function g(x), it turns out that the calculus derivative of the sigmoid function is g(x) * (1 - g(x)). Recall that this example uses the sigmoid function for the input-to-hidden activation function. Here the 0.20 is the local output from the neuron. The -0.76 and 0.37 are the gradients of the output-layer neurons, and the 1.9 and 2.0 are the hidden-to-output weights associated with the two output-layer gradients.

Computing the Weight and Bias Deltas

After all the output-layer gradients and hidden-layer gradients have been computed, the next step in the back-propagation algorithm is to use the gradient values to compute delta values for each weight and bias value. Unlike the gradients, which must be computed right to left, the delta values can be computed in any order. The delta value for any weight or bias is equal to eta times the gradient associated with the weight or bias, times the input value associated with the weight or bias. For example, the delta value for the input-to-hidden weight from input neuron [2] to hidden neuron [3] is:

delta i-h weight[2][3] = eta * hidden gradient[3] * input[2]
= 0.90 * (-0.11) * 3.0
= -0.297  

The 0.90 is eta, which controls how fast the back-propagation learns. Larger values of eta produce larger changes in delta, with the risk of overshooting a good answer. The -0.11 value is the gradient for hidden neuron [3]. The 3.0 value is the input value for input neuron [2]. In terms of the diagram in Figure 3, if a weight is represented as an arrow from one neuron to another, to compute the delta for a particular weight, you use the gradient value of the neuron pointed to on the right and the input value of the neuron pointed from on the left.

When computing the deltas for bias values, notice that because bias values are simply added to an intermediate sum, they have no associated input value. So, to compute the delta for a bias value you can either omit the input value term altogether, or use a dummy 1.0 value as a form of documentation. For example, in Figure 3, the bottommost hidden-layer bias has value -7.0. The delta for that bias value is:

0.90 * gradient of neuron pointed to * 1.0
= 0.90 * (-0.11) * 1.0
= 0.099

Adding a Momentum Term

After all weight and bias delta values have been computed, it’s possible to update each weight and bias by simply adding the associated delta value. However, experience with neural networks has shown that with certain data sets, the back-propagation algorithm can oscillate, repeatedly overshooting and then undershooting the target value and never converging to a final set of weight and bias estimates. One technique for reducing this tendency is to add to each new weight and bias an additional term called momentum. The momentum for a weight (or bias) is just some small value (like 0.4 in the demo program) times the value of the previous delta for the weight. Using momentum adds a small amount of complexity to the back-propagation algorithm because the values of previous deltas must be stored. The math behind why this technique prevents oscillation is subtle, but the result is simple.

To summarize, to update a weight (or bias) using back-propagation, the first step is to compute gradients for all output-layer neurons. The second step is to compute gradients for all hidden-layer neurons. The third step is to compute deltas for all weights using eta, the learning rate. The fourth step is to add the deltas to each weight. The fifth step is to add a momentum term to each weight.  

Coding with Visual Studio 2012

The explanation of back-propagation presented in this article, together with the sample code, should give you enough information to understand and code the back-propagation algorithm. Back-propagation is just one of several techniques that can be used to estimate the best weight and bias values for a data set. Compared to alternatives such as particle swarm optimization and evolutionary optimization algorithms, back-propagation tends to be faster. But back-propagation does have disadvantages. It can’t be used with neural networks that use non-differentiable activation functions. Determining good values for the learning rate and momentum parameters is more art than science and can be time-consuming.

There are several important topics that this article does not address, in particular how to deal with multiple target data items. I’ll explain this concept and other neural network techniques in future articles.

When I coded the demo program for this article, I used the beta version of Visual Studio 2012. Even though many of the new features in Visual Studio 2012 are related to Windows 8 apps, I wanted to see how Visual Studio 2012 handled good old console applications. I was pleasantly surprised that I wasn’t unpleasantly surprised by any of the new features in Visual Studio 2012. My transition to Visual Studio 2012 was effortless. Although I didn’t make use of the new Async feature in Visual Studio 2012, it could have been useful when computing all the delta values for each weight and bias. I tried out the new Call Hierarchy feature and found it useful and intuitive. My initial impressions of Visual Studio 2012 were favorable, and I plan to transition to it as soon as I’m able.

Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Dan Liebling