September 2017

Volume 32 Number 9

### [Test Run]

# Deep Neural Network Training

A regular feed-forward neural network (FNN) has a set of input nodes, a set of hidden processing nodes and a set of output nodes. For example, if you wanted to predict the political leaning of a person (conservative, moderate, liberal) based on their age and income, you could create an FNN with two input nodes, eight hidden nodes and three output nodes. The number of input and output nodes is determined by the structure of your data, but the number of hidden nodes is a free parameter that you have to determine using trial and error.

A deep neural network (DNN) has two or more hidden layers. In this article I’ll explain how to train a DNN using the back-propagation algorithm and describe the associated “vanishing gradient” problem. After reading this article, you’ll have code to experiment with, and a better understanding of what goes on behind the scenes when you use a neural network library such as the Microsoft Cognitive Toolkit (CNTK) or Google TensorFlow.

Take a look at the DNN shown in **Figure 1**. The DNN has three hidden layers that have four, two and two nodes. The input node values are 3.80 and 5.00, which you can imagine as the normalized age (38 years old) and income ($50,000 per year) of a person. Each of the eight hidden nodes has a value between -1.0 and +1.0 because the DNN uses tanh activation.

**Figure 1 An Example 2-(4,2,2)-3 Deep Neural Network**

The preliminary output node values are (-0.109, 3.015, 1.202). These values are converted to the final output values of (0.036, 0.829, 0.135) using the softmax function. The purpose of softmax is to coerce the output node values to sum to 1.0 so that they can be interpreted as probabilities. Assuming the output nodes represent the probabilities of “conservative”, “moderate” and “liberal,” respectively, the person in this example is predicted to be a political moderate.

In **Figure 1**, each of the arrows connecting nodes represents a numeric constant called a weight. Weight values are typically between -10.0 and +10.0, but can be any value in principle. Each of the small arrows pointing into the hidden nodes and the output nodes is a numeric constant called a bias. The values of the weights and the biases determine the output node values.

The process of finding the values of the weights and the biases is called training the network. The idea is to use a large set of training data, which has known input values and known correct output values, and then use an optimization algorithm to find values for the weights and biases so that the difference between computed output values and known correct output values is minimized. There are several algorithms that can be used to train a DNN, but by far the most common is the back-propagation algorithm.

This article assumes you have a basic understanding of the neural network input-output mechanism and at least intermediate level programming skills. The demo program is coded using C# but you shouldn’t have too much trouble refactoring the demo to another language such as Python or Java if you wish. The demo program is too long to present in its entirety in this article, but the complete program is available in the accompanying code download.

## The Demo Program

The demo begins by generating 2,000 synthetic training data items. Each item has four input values between -4.0 and +4.0, followed by three output values, which can be (1, 0, 0) or (0, 1, 0) or (0, 0, 1), representing three possible categorical values. The first and last training items are:

```
[ 0] 2.24 1.91 2.52 2.41 0.00 1.00 0.00
...
[1999] 1.30 -2.41 -3.18 0.11 1.00 0.00 0.00
```

Behind the scenes, the dummy data is generated by creating a 4-(10,10,10)-3 DNN with random weights and biases values, and then feeding random input values to the network. After the training data is generated, the demo creates a new 4-(10,10,10)-3 DNN and trains it using the back-propagation algorithm. During training, the current mean squared error and classification accuracy are displayed every 200 iterations.

The error slowly decreases and the accuracy slowly increases, as you’d expect. After training completes, the final accuracy of the DNN model is 93.45 percent, which means that 0.9345 * 2000 = 1869 items were correctly classified and therefore 131 items were incorrectly classified. The demo code that generates the output begins with:

```
using System;
namespace DeepNetTrain
{
class DeepNetTrainProgram {
static void Main(string[] args) {
Console.WriteLine("Begin deep net demo");
int numInput = 4;
int[] numHidden = new int[] { 10, 10, 10 };
int numOutput = 3;
...
```

The demo program uses only plain C# with no namespaces except for System. First, the DNN to generate the simulated training data is prepared. The number of hidden layers, 3, is passed implicitly as the number of items in the numHidden array. An alternative design is to pass the number of hidden layers explicitly. Next, the training data is generated using helper method MakeData:

```
int numDataItems = 2000;
Console.WriteLine("Generating " + numDataItems +
" artificial training data items ");
double[][] trainData = MakeData(numDataItems,
numInput, numHidden, numOutput, 5);
Console.WriteLine("Done. Training data is: ");
ShowMatrix(trainData, 3, 2, true);
```

The 5 passed to MakeData is a seed value for a random object so that demo runs will be reproducible. The value of 5 was used only because it gave a nice demo. The call to helper ShowMatrix displays the first 3 rows and the last row of the generated data, with 2 decimal places, showing indices (true). Next, the DNN is created and training is prepared:

```
Console.WriteLine("Creating a 4-(10,10,10)-3 DNN");
DeepNet dn = new DeepNet(numInput, numHidden, numOutput);
int maxEpochs = 2000;
double learnRate = 0.001;
double momentum = 0.01;
```

The demo uses a program-defined DeepNet class. The back-propagation algorithm is iterative so a maximum number of iterations, 2,000 in this case, must be specified. The learning rate parameter controls how much the weights and bias values are adjusted each time a training item is processed. A small learning rate could result in training being too slow (hours, days or more) but a large learning rate could lead to wildly oscillating results that never stabilize. Picking a good learning rate is a matter of trial and error and is a major challenge when working with DNNs. The momentum factor is somewhat like an auxiliary learning rate, and typically speeds up training when a small learning rate is used.

The demo program calling code concludes with:

```
...
double[] wts = dn.Train(trainData, maxEpochs,
learnRate, momentum, 10);
Console.WriteLine("Training complete");
double trainError = dn.Error(trainData, false);
double trainAcc = dn.Accuracy(trainData, false);
Console.WriteLine("Final model MS error = " +
trainError.ToString("F4"));
Console.WriteLine("Final model accuracy = " +
trainAcc.ToString("F4"));
Console.WriteLine("End demo ");
}
```

The Train method uses the back-propagation algorithm to find values for the weights and biases so that the difference between computed output values and correct output values is minimized. The values of both the weights and biases are returned by Train. The argument of 10 passed to Train means to display progress messages every 2,000 / 10 = 200 iterations. It’s important to monitor progress because bad things can, and often do, happen when training a neural network.

After training completes, the final error and accuracy of the model are calculated and displayed using the final weights and bias values, which are still inside the DNN. The weights and biases could have been explicitly reloaded by executing the statement dnn.SetWeights(wts), but it’s not necessary in this case. The “false” arguments passed to methods Error and Accuracy mean to not display diagnostic messages.

## Deep Neural Network Gradients and Weights

Each weight and bias in a DNN has an associated gradient value. A gradient is a calculus derivative of the error function and is just a value, such as -1.53, where the sign of the gradient tells you if the associated weight or bias should be increased or decreased to reduce error, and the magnitude of the gradient is proportional to how much the weight or bias should change. For example, suppose one of the weights, w, in a DNN has a value of +4.36, and after a training item is processed, the gradient for the weight, g, is calculated to be +2.50. If the learning rate, lr, is set to 0.10 then the new weight value is:

```
w = w + (lr * g)= 4.36 + (0.10 * 2.50)= 4.36 + 0.25= 4.61
```

So, training a DNN really boils down to finding the gradients for each weight and bias value. As it turns out, calculating the gradients for the weights connecting the last hidden layer nodes to the output layer nodes, and the gradients for the output node biases is relatively easy even though the underlying math is extraordinarily profound. Expressed in code, the first step is to compute what’s called the output node signals for each output node:

```
for (int k = 0; k < nOutput; ++k) {
errorSignal = tValues[k] - oNodes[k];
derivative = (1 - oNodes[k]) * oNodes[k];
oSignals[k] = errorSignal * derivative;
}
```

Local variable errorSignal is the difference between the target value (the correct node value from the training data) and the computed output node value. The details can be very tricky. For example, the demo code uses (target - output) but some references use (output - target), which affects whether the associated weight update statement should add or subtract when modifying weights.

Local variable derivative is a calculus derivative (not the same as the gradient, which is also a derivative) of the output activation function, which in this case is the softmax function. In other words, if you use something other than softmax, you’ll have to modify the calculation of the derivative local variable.

After the output node signals have been computed, they can be used to compute the gradients for the hidden-to-output weights:

```
for (int j = 0; j < nHidden[lastLayer]; ++j) {
for (int k = 0; k < nOutput; ++k) {
hoGrads[j][k] = hNodes[lastLayer][j] * oSignals[k];
}
}
```

In words, the gradient for a weight connecting a hidden node to an output node is the value of the hidden node times the output signal of the output node. After the gradient associated with a hidden-to-output weight has been computed, the weight can be updated:

```
for (int j = 0; j < nHidden[lastLayer]; ++j) {
for (int k = 0; k < nOutput; ++k) {
double delta = hoGrads[j][k] * learnRate;
hoWeights[j][k] += delta;
hoWeights[j][k] += hoPrevWeightsDelta[j][k] * momentum;
hoPrevWeightsDelta[j][k] = delta;
}
}
```

First the weight is incremented by delta, which is the value of the gradient times the learning rate. Then the weight is incremented by an additional amount—the product of the previous delta times the momentum factor. Note that using momentum is optional, but almost always done to increase training speed.

To recap, to update a hidden-to-output weight, you calculate an output node signal, which depends on the difference between target value and computed value, and the derivative of the output node activation function (usually softmax). Next, you use the output node signal and the hidden node value to compute the gradient. Then you use the gradient and the learning rate to compute a delta for the weight, and update the weight using the delta.

Unfortunately, calculating the gradients for the input-to-hidden weights and the hidden-to-hidden weights is much more complicated. A thorough explanation would take pages and pages, but you can get a good idea of the process by examining one part of the code:

```
int lastLayer = nLayers - 1;
for (int j = 0; j < nHidden[lastLayer]; ++j) {
derivative = (1 + hNodes[lastLayer][j]) *
(1 - hNodes[lastLayer][j]); // For tanh
double sum = 0.0;
for (int k = 0; k < nOutput; ++k) {
sum += oSignals[k] * hoWeights[j][k];
}
hSignals[lastLayer][j] = derivative * sum;
}
```

This code calculates the signals for the last hidden layer nodes—those just before the output nodes. The local variable derivative is the calculus derivative of the hidden layer activation function, tanh in this case. But the hidden signals depend on a sum of products that involves the output node signals. This leads to the “vanishing gradient” problem.

## The Vanishing Gradient Problem

When you use the back-propagation algorithm to train a DNN, during training the gradient values associated with hidden-to-hidden weights quickly become very small or even zero. If a gradient value is zero, then the gradient times the learning rate will be zero, and the weight delta will be zero, and the weight will not change. Even if a gradient doesn’t go to zero, but gets very small, the delta will be tiny and training will slow to a crawl.

The reason gradients quickly head toward zero should be clear if you carefully examine the demo code. Because output node values are coerced to probabilities, they’re all between 0 and 1. This leads to output node signals that are between 0 and 1. The multiplication part of computing the hidden node signals therefore involves repeatedly multiplying values between 0 and 1, which will result in smaller and smaller gradients. For example, 0.5 * 0.5 * 0.5 * 0.5 = 0.0625. Additionally, the tanh hidden layer activation function introduces another fraction-times-fraction term.

The demo program illustrates the vanishing gradient problem by spying on the gradient associated with the weight from node 0 in the input layer to node 0 in the first hidden layer. The gradient for that weight decreases quickly:

```
epoch = 200 gradient = -0.002536
epoch = 400 gradient = -0.000551
epoch = 600 gradient = -0.000141
epoch = 800 gradient = -0.159148
epoch = 1000 gradient = -0.000009
...
```

The gradient temporarily jumps up at epoch 800 because the demo updates weights and biases after every training item is processed (this is called “stochastic” or “online” training, as opposed to “batch” or “mini-batch” training), and by pure chance the training item processed at epoch 800 led to a larger than normal gradient.

In the early days of DNNs, perhaps 25 to 30 years ago, the vanishing gradient problem was a show-stopper. As computing power increased, the vanishing gradient became less of a problem because training could afford to slow down a bit. But with the rise of very deep networks, with hundreds or even thousands of hidden layers, the problem resurfaced.

Many techniques have been developed to tackle the vanishing gradient problem. One approach is to use the rectified linear unit function (ReLU) instead of the tanh function for hidden layer activation. Another approach is to use different learning rates for different layers—larger rates for layers closer to the input layer. And the use of GPUs for deep learning is now the norm. A radical approach is to avoid back-propagation altogether, and instead use an optimization algorithm that doesn’t require gradients, such as particle swarm optimization.

## Wrapping Up

The term deep neural network most often refers to the type of network described in this article—a fully connected network with multiple hidden layers. But there are many other types of deep neural networks. Convolutional neural networks are very good at image classification. Long short-term memory networks are extremely good at natural language processing. Most of the variations of deep neural networks use some form of back-propagation and are subject to the vanishing gradient problem.

**Dr. James McCaffrey***works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com*.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee and Adith Swaminathan