April 2019

Volume 34 Number 4

[Artificially Intelligent]

How Do Neural Networks Learn?

By Frank La | April 2019

Frank La VigneIn my previous column (“A Closer Look at Neural Networks,” msdn.com/magazine/mt833269), I explored the basic structure of neural networks and created one from scratch with Python. After reviewing the basic structures common to all neural networks, I created a sample framework for computing the weighted sums and output values. Neurons themselves are simple and perform basic mathematical functions to normalize their outputs between 1 and 0 or -1 and 1. They become powerful, however, when they’re connected to each other. Neurons are arranged in layers in a neural network and each neuron passes on values to the next layer. Input values cascade forward through the network and affect the output in a process called forward propagation.

However, exactly how do neural networks learn? What is the process and what happens inside a neural network when it learns? In the previous column, the focus was on the forward propagation of values. For supervised learning scenarios, neural networks can leverage a process called backpropagation.

Backpropagation, Loss and Epochs

Recall that each neuron in a neural network takes in input values multiplied by a weight to represent the strength of that connection. Backpropagation discovers the correct weights that should be applied to nodes in a neural network by comparing the network’s current outputs with the desired, or correct, outputs. The difference between the desired output and the current output is computed by the Loss, or Cost, function. In other words, the Loss function tells us how accurate our neural network is at making predictions for a given input.

You might be familiar with the loss (error) function associated with classical statistics linear regression, as shown in Figure 1. That loss function provides the average of the squared differences between correct output values (the yi) and the computed values, which depend on the slope (m) and the y-intercept (b) of the regression line. The loss function for a neural network classifier uses the same general principle -- the difference between correct output values and computed output values. There are, in fact, three common loss functions for neural networks: mean squared error, cross entropy error, and binary cross entropy error. The demo program in this article uses cross entropy error, which is a complex topic in its own right.

The Cost, or Loss, Function
Figure 1 The Cost, or Loss, Function

The algorithm then adjusts each weight to minimize the differ­ence between the computed value and the correct value. The term “backpropagation” comes from the fact that the algorithm goes back and adjusts the weights and biases after computing an answer. The smaller the Loss for a network, the more accurate it becomes. The learning process, then, can be quantified as minimizing the loss function’s output. Each cycle of forward propagation and backpropagation correction to lower the Loss is called an epoch. Simply put, backpropagation is about finding the best input weights and biases to get a more accurate output or “minimize the Loss.” If you’re thinking this sounds computationally expensive, it is. In fact, compute power was insufficient until relatively recently to make this process practical for wide use.

Gradient Descent, Learning Rate and Stochastic Gradient Descent

How are the weights adjusted in each epoch? Are they randomly adjusted or is there a process? This is where a lot of beginners start to get confused, as there are a lot of unfamiliar terms thrown around, like gradient descent and learning rate. However, it’s really not that complicated when explained properly. The Loss function reduces all the complexity of a neural network down to a single number that indicates how far off the neural network’s, answer is from the desired answer. Thinking of the neural network’s output as a single number allows us to think about its performance in simple terms. The goal is to find the series of weights that results in the lowest loss value, or the minimum.

Plotting this on a graph, as in Figure 2, shows that the Loss function has its own curve and gradients that can be used as a guide to adjust the weights. The slope of the Loss function’s curve serves as a guide and points to the minimum value. The goal is to locate the minimum across the entire curve, which represents the inputs where the neural network is most accurate.

Graph of the Loss Function with a Simple Curve
Figure 2 Graph of the Loss Function with a Simple Curve

In Figure 2, adding more to the weights reaches a low point and then starts to climb again. The slope of the line reveals the direction to that lowest point on the curve, which represents the lowest loss. When the slope is negative, add to the weights. When the slope is positive, subtract from the weights. The specific amount added or subtracted to the weights is known as the Learning Rate. Determining an ideal learning rate is as much an art as it is a science. Too large and the algorithm could overshoot the minimum. Too low and the training will take too long. This process is called Gradient Descent. Readers who are more familiar with the intricacies of calculus will see this process for what it is: determining the derivative of the Loss function.

Rarely, however, is the graph of a Loss function as simple as the one in Figure 2. In practice, there are many peaks and valleys. The challenge then becomes how to find the lowest of the low points (the global minimum) and not get fooled by low points nearby (local minima). The best approach in this situation is to pick a point along the curve at random and then proceed with the gradient descent process previously described, hence the term “Stochastic Gradient Descent.” For a great explanation of the mathematical concepts on this process, watch the YouTube video, “Gradient Descent, How Neural Networks Learn | Deep Learning, Chapter 2,” at youtu.be/IHZwWFHWa-w.

For the most part, this level of neural network architecture has been largely abstracted away by libraries such as Keras and TensorFlow. As in any software engineering endeavor, knowing the fundamentals always helps when faced with challenges in the field.

Putting Theory to Practice

In the previous column, I had created a neural network from scratch to process the MNIST digits. The resulting code base to bootstrap the problem was great at illustrating the inner workings of neural network architectures, but was impractical to bring forward. There exist so many frameworks and libraries now that perform the same task with less code.

To get started, open a new Jupyter notebook and enter the following into a blank cell and execute it to import all the required libraries:

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt

Note that the output from this cell states that Keras is using a TensorFlow back end. Because the MNIST neural network example is so common, Keras includes it as part of its API, and even splits the data into a training set and a test set. Write the following code into a new cell and execute it to download the data and read it into the appropriate variables:

# import the data
from keras.datasets import mnist
# read the data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

Once the output indicates that the files are downloaded, use the following code to briefly examine the training and test dataset:


The output should read that the x_train dataset has 60,000 items and the x_test dataset has 10,000 items. Both consist of a 28x28 matrix of pixels. To see a particular image from the MNIST data, use MatPlotLib to render an image with the following code:


The output should look like a handwritten “3.” To see what’s inside the testing dataset, enter the following code:


The output shows a zero. Feel free to experiment by changing the index number and the dataset to explore the image datasets.

Shaping the Data

As with any AI or data science project, the input data must be reshaped to fit the needs of the algorithms. The image data needs to be flattened into a one-dimensional vector. As each image is 28x28 pixels, the one-dimensional vector will be 1 by (28x28), or 1 by 784. Enter the following code into a new cell and execute (note that this will not produce output text):

num_pixels = X_train.shape[1] * X_train.shape[2]
X_train = X_train.reshape(X_train.shape[0], num_pixels).astype('float32')
X_test = X_test.reshape(X_test.shape[0], num_pixels).astype('float32')

Pixel values range from zero to 255. In order to use them, you’ll need to normalize them to values between zero and one. Use the following code to do that:

X_train = X_train / 255
X_test = X_test / 255

Then enter the following code to take a look at what the data looks like now:


The output reveals an array of 784 values between zero and one.

The task of taking in various images of handwritten digits and determining what number they represent is classification. Before building the model, you’ll need to split the target variables into categories. In this case, you know that there are 10, but you can use the to_categorical function in Keras to determine that auto­matically. Enter the following code and execute it (the output should display 10):

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
num_classes = y_test.shape[1]

Build, Train and Test the Neural Network

Now that the data has been shaped and prepared, it’s time to build out the neural networks using Keras. Enter the following code to create a function that creates a sequential neural network with three layers with an input layer of num_pixels (or 784) neurons:

def classification_model():
  model = Sequential()
  model.add(Dense(num_pixels, activation='relu', input_shape=(num_pixels,)))
  model.add(Dense(100, activation='relu'))
  model.add(Dense(num_classes, activation='softmax'))
  model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
  return model

Compare this code to the code from my last column’s “from scratch” methods. You may notice new terms like “relu” or “softmax” referenced in the activation functions. Up until now, I’ve only explored the Sigmoid activation function, but there are several kinds of activation functions. For now, keep in mind that all activ­ation functions compress an input value by outputting a value between 0 and 1 or -1 and 1.

With all of the infrastructure in place, it’s time to build, train and score the model. Enter the following code into a blank cell and execute it:

model = classification_model()
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, verbose=2)
scores = model.evaluate(X_test, y_test, verbose=0)

As the neural network runs, note that the loss value drops with each iteration. Accordingly, the accuracy also improves. Also, take note of how long each epoch takes to execute. Once completed, enter the following code to see the accuracy and error percentages:

print('Model Accuracy: {} \n Error: {}'.format(scores[1], 1 - scores[1]))

The output reveals an accuracy of greater than 98 percent and an error of 1.97 percent.

Persisting the Model

Now that the model has been trained to a high degree of accuracy, you can save the model for future use to avoid having to train it again. Fortunately, Keras makes this easy. Enter the following code into a new cell and execute it:


This creates a binary file that’s about 8KB in size and contains the optimum values for weights and biases. Loading the model is also easy with Keras, like so:

from keras.models import load_model
pretrained_model = load_model('MNIST_classification_model.h5')

This h5 file contains the model and can be deployed along with code to reshape and prepare the input image data. In other words, the lengthy process of training a model needs only to be done once. Referencing the predefined model doesn’t require the computationally expensive process of training and, in the final production system, the neural network can be implemented rapidly.

Wrapping Up

Neural networks can solve problems that have confounded traditional algorithms for decades. As we’ve seen, their simple structure hides their true complexity. Neural networks work by propagating forward inputs, weights and biases. However, it’s the reverse process of backpropagation where the network actually learns by determining the exact changes to make to weights and biases to produce an accurate result.

Learning, in the machine sense, is about minimizing the difference between the actual result and the correct result. This process is tedious and compute-expensive, as evidenced by the time it takes to run through one epoch. Fortunately, this training needs only to be done once and not each time the model is needed. Additionally, I explored using Keras to build out this neural network. While it is possible to write the code needed to build out neural networks from scratch, it’s far simpler to use existing libraries like Keras, which take care of the minute details for you.

Frank La Vigne works at Microsoft as an AI Technology Solutions professional where he helps companies achieve more by getting the most out of their data with analytics and AI. He also co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).

Thanks to the following technical experts for reviewing this article: Andy Leonard

Discuss this article in the MSDN Magazine forum