August 2017

Volume 32 Number 8

# Deep Neural Network IO Using C#

Many of the recent advances in machine learning (making predictions using data) have been realized using deep neural networks. Examples include speech recognition in Microsoft Cortana and Apple Siri, and the image recognition that helps enable self-driving automobiles.

The term deep neural network (DNN) is general and there are several specific variations, including recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The most basic form of a DNN, which I explain in this article, doesn’t have a special name, so I’ll refer to it just as a DNN.

This article will introduce you to DNNs so you’ll have a concrete demo program to experiment with, which will help you to understand literature on DNNs. I won’t present code that can be used directly in a production system, but the code can be extended to create such a system, as I’ll explain. Even if you never intend to implement a DNN, you might find the explanation of how they work interesting for its own sake.

A DNN is best explained visually. Take a look at Figure 1. The deep network has two input nodes, on the left, with val-ues (1.0, 2.0). There are three output nodes on the right, with values (0.3269, 0.3333, 0.3398). You can think of a DNN as a complex math function that typically accepts two or more numeric input values and returns one or more numeric output values.

Figure 1 A Basic Deep Neural Network

The DNN shown might correspond to a problem where the goal is to predict the political party affiliation (Democrat, Republican, Other) of a person based on age and income, where the input values are scaled in some way. If Democrat is encoded as (1,0,0) and Republican is encoded as (0,1,0) and Other is encoded as (0,0,1), then the DNN in Figure 1 predicts Other for someone with age = 1.0 and income = 2.0 because the last output value (0.3398) is the largest.

A regular neural network has a single hidden layer of processing nodes. A DNN has two or more hidden layers and can handle very difficult prediction problems. Specialized types of DNNs, such as RNNs and CNNs, also have multiple layers of processing nodes, but more complicated connection architectures, as well.

The DNN in Figure 1 has three hidden layers of processing nodes. The first hidden layer has four nodes, the second and third hidden layers have two nodes. Each long arrow pointing from left to right represents a numeric constant called a weight. If nodes are zero-base indexed with [0] at the top of the figure, then the weight connecting input[0] to hid-den[0][0] (layer 0, node 0) has value 0.01 and the weight connecting input[1] to hidden[0][3] (layer 0, node 3) has value 0.08 and so on. There are 26 node-node weight values.

Each of the eight hidden and three output nodes has a small arrow that represents a numeric constant called a bias. For example, hidden[2][0] has bias value of 0.33 and output[1] has a bias value of 0.36. Not all of the weights and bias values are labeled in the diagram, but because the values are sequential between 0.01 and 0.37, you can easily determine the value of a non-labeled weight or bias.

In the sections that follow, I explain how the DNN input-output mechanism works and show how to implement it. The demo program is coded using C#, but you shouldn’t have too much trouble refactoring the code to another language, such as Python or JavaScript, if you wish to do so. The demo program is too long to present in its entirety in this article, but the complete program is available in the accompanying code download.

## The Demo Program

A good way to see where this article is headed is to examine the screenshot of the demo program in Figure 2. The demo corresponds to the DNN shown in Figure 1 and illustrates the input-output mechanism by displaying the values of the 13 nodes in the network. The demo code that generated the output begins with the code shown in Figure 3.

Figure 2 Basic Deep Neural Network Demo Run

Figure 3 Beginning of Output-Generating Code

using System;
namespace DeepNetInputOutput
{
class DeepInputOutputProgram
{
static void Main(string[] args)
{
Console.WriteLine("Begin deep net IO demo");
Console.WriteLine("Creating a 2-(4-2-2)-3 deep network");
int numInput = 2;
int[] numHidden = new int[] { 4, 2, 2 };
int numOutput = 3;
DeepNet dn = new DeepNet(numInput, numHidden, numOutput);

Notice that the demo program uses only plain C# with no namespaces except for System. The DNN is created by passing the number of nodes in each layer to a DeepNet program-defined class constructor. The number of hidden layers, 3, is passed implicitly as the number of items in the numHidden array. An alternative design is to pass the number of hidden layers explicitly.

The values of the 26 weights and the 11 biases are set like so:

int nw = DeepNet.NumWeights(numInput, numHidden, numOutput);
Console.WriteLine("Setting weights and biases to 0.01 to " +
(nw/100.0).ToString("F2") );
double[] wts = new double[nw];
for (int i = 0; i < wts.Length; ++i)
wts[i] = (i + 1) * 0.01;
dn.SetWeights(wts);

The total number of weights and biases is calculated using a static class method NumWeights. If you refer back to Figure 1, you can see that because each node is connected to all nodes in the layer to the right, the number of weights is (2*4) + (4*2) + (2*2) + (2*3) = 8 + 8 + 4 + 6 = 26. Because there’s one bias for reach hidden and output node, the total number of biases is 4 + 2 + 2 + 3 = 11.

An array named wts is instantiated with 37 cells and then the values are set to 0.01 through 0.37. These values are inserted into the DeepNet object using the SetWeights method. In a realistic, non-demo DNN, the values of the weights and biases would be determined using a set of data that has known input values and known, correct output values. This is called training the network. The most common training algorithm is called back-propagation.

The Main method of the demo program concludes with:

...
Console.WriteLine("Computing output for [1.0, 2.0] ");
double[] xValues = new double[] { 1.0, 2.0 };
dn.ComputeOutputs(xValues);
dn.Dump(false);
Console.WriteLine("End demo");
} // Main
} // Class Program

Method ComputeOutputs accepts an array of input values and then uses the input-output mechanism, which I’ll explain shortly, to calculate and store the values of the output nodes. The Dump helper method displays the values of the 13 nodes, and the "false" argument means to not display the values of the 37 weights and biases.

## The Input-Output Mechanism

The input-output mechanism for a DNN is best explained with a concrete example. The first step is to use the values in the input nodes to calculate the values of the nodes in the first hidden layer. The value of the top-most hidden node in the first hidden layer is:
tanh( (1.0)(0.01) + (2.0)(0.05) + 0.27 ) =
tanh(0.38) = 0.3627

In words, "compute the sum of the products of each input node and its associated weight, add the bias value, then take the hyperbolic tangent of the sum." The hyperbolic tangent, abbreviated tanh, is called the activation function. The tanh function accepts any value from negative infinity to positive infinity, and returns a value between -1.0 and +1.0. Important alternative activation functions include the logistic sigmoid and rectified linear (ReLU) functions, which are outside the scope of this article.

The values of the nodes in the remaining hidden layers are calculated in exactly the same way. For example, hidden[1][0] is:
tanh( (0.3627)(0.09) + (0.3969)(0.11) + (0.4301)(0.13) + (0.4621)(0.15) + 0.31 ) =
tanh(0.5115) = 0.4711
And hidden[2][0] is:
tanh( (0.4711)(0.17) + (0.4915)(0.19) + 0.33 ) =
tanh(0.5035) = 0.4649

The values of the output nodes are calculated using a different activation function, called softmax. The preliminary, pre-activation sum-of-products plus bias step is the same:
pre-activation output[0] =
(.4649)(0.21) + (0.4801)(0.24) + 0.35 =
0.5628
pre-activation output[1] =
(.4649)(0.22) + (0.4801)(0.25) + 0.36 =
0.5823
pre-activation output[2] =
(.4649)(0.23) + (0.4801)(0.26) + 0.37 =
0.6017

The softmax of three arbitrary values, x, y, y is:
softmax(x) = e^x / (e^x + e^y + e^z)
softmax(y) = e^y / (e^x + e^y + e^z)
softmax(z) = e^z / (e^x + e^y + e^z)

where e is Euler’s number, approximately 2.718282. So, for the DNN in Figure 1, the final output values are:

output[0] = e^0.5628 / (e^0.5628 + e^0.5823 + e^0.6017) = 0.3269
output[1] = e^0.5823 / (e^0.5628 + e^0.5823 + e^0.6017) = 0.3333
output[2] = e^0.6017 / (e^0.5628 + e^0.5823 + e^0.6017) = 0.3398

The purpose of the softmax activation function is to coerce the output values to sum to 1.0 so that they can be interpreted as probabilities and map to a categorical value. In this example, because the third output value is the largest, whatever categorical value that was encoded as (0,0,1) would be the predicted category for inputs = (1.0, 2.0).

## Implementing a DeepNet Class

To create the demo program, I launched Visual Studio and selected the C# Console Application template and named it DeepNetInputOutput. I used Visual Studio 2015, but the demo has no significant .NET dependencies, so any version of Visual Studio will work.

After the template code loaded, in the Solution Explorer window, I right-clicked on file Program.cs and renamed it to the more descriptive DeepNetInputOutputProgram.cs and allowed Visual Studio to automatically rename class Program for me. At the top of the editor window, I deleted all unnecessary using statements, leaving just the one that references the System namespace.

I implemented the demo DNN as a class named DeepNet. The class definition begins with:

public class DeepNet
{
public static Random rnd;
public int nInput;
public int[] nHidden;
public int nOutput;
public int nLayers;
...

All class members are declared with public scope for simplicity. The static Random object member named rnd is used by the DeepNet class to initialize weights and biases to small random values (which are then overwritten with values 0.01 to 0.37). Members nInput and nOuput are the number of input and output nodes. Array member hHidden holds the number of nodes in each hidden layer, so the number of hidden layers is given by the Length property of the array, which is stored into member nLayers for convenience. The class definition continues:

public double[] iNodes;
public double [][] hNodes;
public double[] oNodes;

A deep neural network implementation has many design choices. Array members iNodes and oNodes hold the input and output values, as you’d expect. Array-of-arrays member hNodes holds the hidden node values. An alternative design is to store all nodes in a single array-of-arrays structure nnNodes, where in the demo nnNodes[0] is an array of input node values and nnNodes[4] is an array of output node values.

The node-to-node weights are stored using these data structures:

public double[][] ihWeights;
public double[][][] hhWeights;
public double[][] hoWeights;

Member ihWeights is an array-of-arrays-style matrix that holds the input-to-first-hidden-layer weights. Member hoWeights is an array-of-arrays-style matrix that holds the weights connecting the last hidden layer nodes to the output nodes. Member hhWeights is an array where each cell points to an array-of-arrays matrix that holds the hidden-to-hidden weights. For example, hhWeights[0][3][1] holds the weights connecting hidden node [3] in hidden layer [0] to hidden node [1] in hidden layer [0+1].These data structures are the heart of the DNN input-output mechanism and are a bit tricky. A conceptual diagram of them is shown in Figure 4.

Figure 4 Weights and Biases Data Structures

The last two class members hold the hidden node biases and the output node biases:

public double[][] hBiases;
public double[] oBiases;

As much as any software system I work with, DNNs have many alternative data structure designs, and having a sketch of these data structures is essential when writing input-output code.

## Computing the Number of Weights and Biases

To set the weights and biases values, it’s necessary to know how many weights and biases there are. The demo program implements the static method NumWeights to calculate and return this number. Recall that the 2-(4-2-2)-3 demo network has (2*4) + (4*2) + (2*2) + (2*3) = 26 weights and 4 + 2 + 2 + 3 = 11 biases. The key code in method NumWeights, which calculates the number of input-to-hidden, hidden-to-hidden and hidden-to-output weights is:

int ihWts = numInput * numHidden[0];
int hhWts = 0;
for (int j = 0; j < numHidden.Length - 1; ++j) {
int rows = numHidden[j];
int cols = numHidden[j + 1];
hhWts += rows * cols;
}
int hoWts = numHidden[numHidden.Length - 1] * numOutput;

Instead of returning the total number of weights and biases as method NumWeights does, you might want to consider returning the number of weights and biases separately, in a two-cell integer array.

## Setting Weights and Biases

A non-demo DNN typically initializes all weights and biases to small random values. The demo program sets the 26 weights to 0.01 through 0.26, and the biases to 0.27 through 0.37 using class method SetWeights. The definition begins with:

public void SetWeights(double[] wts)
{
int nw = NumWeights(this.nInput, this.nHidden, this.nOutput);
if (wts.Length != nw)
throw new Exception("Bad wts[] length in SetWeights()");
int ptr = 0;
...

Input parameter wts holds the values for the weights and biases, and is assumed to have the correct Length. Variable ptr points into the wts array. The demo program has very little error checking in order to keep the main ideas as clear as possible. The input-to-first-hidden-layer weights are set like so:

for (int i = 0; i < nInput; ++i)
for (int j = 0; j < hNodes[0].Length; ++j)
ihWeights[i][j] = wts[ptr++];

Next, the hidden-to-hidden weights are set:

for (int h = 0; h < nLayers - 1; ++h)
for (int j = 0; j < nHidden[h]; ++j)  // From
for (int jj = 0; jj < nHidden[h+1]; ++jj)  // To
hhWeights[h][j][jj] = wts[ptr++];

If you’re not accustomed to working with multi-dimensional arrays, the indexing can be quite tricky. A diagram of the weights and biases data structures is essential (well, for me, anyway). The last-hidden-layer-to-output weights are set like this:

int hi = this.nLayers - 1;
for (int j = 0; j < this.nHidden[hi]; ++j)
for (int k = 0; k < this.nOutput; ++k)
hoWeights[j][k] = wts[ptr++];

This code uses the fact that if there are nLayers hidden (3 in the demo), then the index of the last hidden layer is nLayers-1. Method SetWeights concludes by setting the hidden node biases and the output node biases:

...
for (int h = 0; h < nLayers; ++h)
for (int j = 0; j < this.nHidden[h]; ++j)
hBiases[h][j] = wts[ptr++];

for (int k = 0; k < nOutput; ++k)
oBiases[k] = wts[ptr++];
}

## Computing the Output Values

The definition of class method ComputeOutputs begins with:

public double[] ComputeOutputs(double[] xValues)
{
for (int i = 0; i < nInput; ++i)
iNodes[i] = xValues[i];
...

The input values are in array parameter xValues. Class member nInput holds the number of input nodes and is set in the class constructor. The first nInput values in xValues are copied into the input nodes, so xValues is assumed to have at least nInput values in the first cells. Next, the current values in the hidden and output nodes are zeroed-out:

for (int h = 0; h < nLayers; ++h)
for (int j = 0; j < nHidden[h]; ++j)
hNodes[h][j] = 0.0;

for (int k = 0; k < nOutput; ++k)
oNodes[k] = 0.0;

The idea here is that the sum of products term will be accumulated directly into the hidden and output nodes, so these nodes must be explicitly reset to 0.0 for each method call. An alternative is to declare and use local arrays with names like hSums[][] and oSums[]. Next, the values of the nodes in the first hidden layer are calculated:

for (int j = 0; j < nHidden[0]; ++j) {
for (int i = 0; i < nInput; ++i)
hNodes[0][j] += ihWeights[i][j] * iNodes[i];
hNodes[0][j] += hBiases[0][j];  // Add the bias
hNodes[0][j] = Math.Tanh(hNodes[0][j]);  // Activation
}

The code is pretty much a one-one mapping of the mechanism described earlier. The built-in Math.Tanh is used for hidden node activation. As I mentioned, important alternatives are the logistic sigmoid function and the rectified linear unit (ReLU) functions, which I’ll explain in a future article. Next, the remaining hidden layer nodes are calculated:

for (int h = 1; h < nLayers; ++h) {
for (int j = 0; j < nHidden[h]; ++j) {
for (int jj = 0; jj < nHidden[h-1]; ++jj)
hNodes[h][j] += hhWeights[h-1][jj][j] * hNodes[h-1][jj];
hNodes[h][j] += hBiases[h][j];
hNodes[h][j] = Math.Tanh(hNodes[h][j]);
}
}

This is the trickiest part of the demo program, mostly due to the multiple array indexes required. Next, the pre-activation sum-of-products are calculated for the output nodes:

for (int k = 0; k < nOutput; ++k) {
for (int j = 0; j < nHidden[nLayers - 1]; ++j)
oNodes[k] += hoWeights[j][k] * hNodes[nLayers - 1][j];
oNodes[k] += oBiases[k];  // Add bias
}

Method ComputeOutputs concludes by applying the softmax activation function, returning the computed output values in a separate array:

...
double[] retResult = Softmax(oNodes);
for (int k = 0; k < nOutput; ++k)
oNodes[k] = retResult[k];
return retResult;
}

The Softmax method is a static helper. See the accompanying code download for details. Notice that because softmax activation requires all the values that will be activated (in the denominator term), it’s more efficient to compute all softmax values at once instead of separately. The final output values are stored into the output nodes and are also returned separately for calling convenience.

## Wrapping Up

There has been enormous research activity and many breakthroughs related to deep neural networks over the past few years. Specialized DNNs such as convolutional neural networks, recurrent neural networks, LSTM neural networks and residual neural networks are very powerful but very complex. In my opinion, understanding how basic DNNs operate is essential for understanding the more complex variations.

In a future article, I’ll explain in detail how to use the back-propagation algorithm (arguably the most famous and important algorithm in machine learning) to train a basic DNN. Back-propagation, or at least some form of it, is used to train most DNN variations, too. This explanation will introduce the concept of the vanishing gradient, which in turn will explain the design and motivation of many of the DNNs now being used for very sophisticated prediction systems.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Li Deng, Pingjun Hu, Po-Sen Huang, Kirk Li, Alan Liu, Ricky Loynd, Baochen Sun, Henrik Turbell.