March 2019

Volume 34 Number 3

# Neural Regression Using PyTorch

The goal of a regression problem is to predict a single numeric value. For example, you might want to predict the price of a house based on its square footage, age, ZIP code and so on. In this article I show how to create a neural regression model using the PyTorch code library. The best way to understand where this article is headed is to take a look at the demo program in Figure 1.

Figure 1 Neural Regression Using a PyTorch Demo Run

The demo program creates a prediction model based on the Boston Housing dataset, where the goal is to predict the median house price in one of 506 towns close to Boston. The data comes from the early 1970s. Each data item has 13 predictor variables, such as crime index of the town, average number of rooms per house in the town and so on. There’s only one output value because the goal is to predict a single numeric value.

The demo loads 404 training items and 102 test items into memory, and then creates a 13-(10-10)-1 neural network. The neural network has two hidden processing layers, each of which has 10 nodes. The number of input and output nodes is determined by the data, but the number of hidden layers and the number of nodes in each are free parameters that must be determined by trial and error.

The demo trains the neural network, meaning the values of the weights and biases that define the behavior of the neural network are computed using the training data, which has known correct input and output values. After training, the demo computes the accuracy of the model on the test data (75.49 percent, 77 out of 102 correct). The test accuracy is a rough measure of how well you’d expect the model to do on new, previously unseen data.

The demo concludes by making a prediction for the first test town. The 13 raw input values are (0.09266, 34.0, . . 8.67). When the neural regression model was trained, normalized data (scaled so all values are between 0.0 and 1.0) was used, so when making a prediction the demo had to use normalized data, which is (0.00097, 0.34, . . 0.191501). The model predicts that the median house price is \$24,870.07, quite close to the actual median price of \$26,400.

This article assumes you have intermediate or better programming skill with a C-family language and a basic familiarity with machine learning. The complete demo code is presented in this article. The source code and the two data files used by the demo are also available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

## Installing PyTorch

Installing PyTorch involves two main steps. First you install Python and several required auxiliary packages, such as NumPy and SciPy, then you install PyTorch as an add-on Python package.

Although it’s possible to install Python and the packages required to run PyTorch separately, it’s much better to install a Python distribution. For my demo, I installed the Anaconda3 5.2.0 distribution (which contains Python 3.6.5) and PyTorch 1.0.0. If you’re new to Python, be aware that installing and managing add-on package dependencies is non-trivial.

After installing Python via the Anaconda distribution, the PyTorch package can be installed using the pip utility function with a .whl (“wheel”) file. PyTorch comes in a CPU-only version and in a GPU version. I used the CPU-only version.

## Understanding the Data

The Boston Housing dataset comes from a research paper written in 1978 that studied air pollution. You can find different versions of the dataset in many locations on the Internet. The first data item is:

``````0.00632, 18.00, 2.310, 0, 0.5380, 6.5750, 65.20
4.0900, 1, 296.0, 15.30, 396.90, 4.98, 24.00
``````

Each data item has 14 values and represents one of 506 towns near Boston. The first 13 numbers are the values of predictor variables and the last value is the median house price in the town (divided by 1,000). Briefly, the 13 predictor variables are: crime rate in the town, large lot percentage, percentage zoned for industry, adjacency to Charles River, pollution, average number rooms per house, house age information, distance to Boston, accessibility to highways, tax rate, pupil-teacher ratio, proportion of Black residents, and percentage of low-status residents.

Because there are 14 variables, it’s not possible to visualize the dataset, but you can get a rough idea of the data from the graph in Figure 2. The graph shows median house price as a function of the percentage of town zoned for industry for the 102 items in the test dataset.

Figure 2 Partial Boston Area House Dataset

When working with neural networks, you must encode non-­numeric data and you should normalize numeric data so that large values, such as a pupil-teacher ratio of 20.0, don’t overwhelm small values, such as a pollution reading of 0.538. The Charles River variable is a categorical value stored either as 0 (town is not adjacent) or 1 (adjacent). Those values were re-encoded as -1 and +1. The other 12 predictor variables are numeric. For each variable, I computed the min value and the max value, and then for every value x, normalized it as (x - min) / (max - min). After min-max normalization, all values will be between 0.0 and 1.0.

The median house values in the raw data were already normalized by dividing by 1,000, so the values ranged from 5.0 to 50.0, with most at about 25.0. I applied an additional normalization by dividing the prices by 10 so that all median house prices were between 0.5 and 5.0, with most being around 2.5.

## The Demo Program

The complete demo program, with a few minor edits to save space, is presented in Figure 3. I indent two spaces rather than the usual four spaces to save space. And note that Python uses the ‘\’ character for line continuation. I used Notepad to edit my program, but many of my colleagues prefer Visual Studio or VS Code, both of which have excellent support for Python.

Figure 3 The Boston Housing Demo Program

``````# boston_dnn.py
# Boston Area House Price dataset regression
# Anaconda3 5.2.0 (Python 3.6.5), PyTorch 1.0.0
import numpy as np
import torch as T  # non-standard alias
# ------------------------------------------------------------
def accuracy(model, data_x, data_y, pct_close):
n_items = len(data_y)
X = T.Tensor(data_x)  # 2-d Tensor
Y = T.Tensor(data_y)  # actual as 1-d Tensor
oupt = model(X)       # all predicted as 2-d Tensor
pred = oupt.view(n_items)  # all predicted as 1-d
n_correct = T.sum((T.abs(pred - Y) < T.abs(pct_close * Y)))
result = (n_correct.item() * 100.0 / n_items)  # scalar
return result
# ------------------------------------------------------------
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(13, 10)  # 13-(10-10)-1
self.hid2 = T.nn.Linear(10, 10)
self.oupt = T.nn.Linear(10, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)  # glorot
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = self.oupt(z)  # no activation, aka Identity()
return z
# ------------------------------------------------------------
def main():
# 0. Get started
print("\nBoston regression using PyTorch 1.0 \n")
T.manual_seed(1);  np.random.seed(1)
train_file = ".\\Data\\boston_train.txt"
test_file = ".\\Data\\boston_test.txt"
usecols=range(0,13), dtype=np.float32)
usecols=[13], dtype=np.float32)
usecols=range(0,13), dtype=np.float32)
usecols=[13], dtype=np.float32)
# 2. Create model
print("Creating 13-(10-10)-1 DNN regression model \n")
net = Net()  # all work done above
# 3. Train model
net = net.train()  # set training mode
bat_size = 10
loss_func = T.nn.MSELoss()  # mean squared error
optimizer = T.optim.SGD(net.parameters(), lr=0.01)
n_items = len(train_x)
batches_per_epoch = n_items // bat_size
max_batches = 1000 * batches_per_epoch
print("Starting training")
for b in range(max_batches):
curr_bat = np.random.choice(n_items, bat_size,
replace=False)
X = T.Tensor(train_x[curr_bat])
Y = T.Tensor(train_y[curr_bat]).view(bat_size,1)
oupt = net(X)
loss_obj = loss_func(oupt, Y)
loss_obj.backward()
optimizer.step()
if b % (max_batches // 10) == 0:
print("batch = %6d" % b, end="")
print("  batch loss = %7.4f" % loss_obj.item(), end="")
net = net.eval()
acc = accuracy(net, train_x, train_y, 0.15)
net = net.train()
print("  accuracy = %0.2f%%" % acc)
print("Training complete \n")
# 4. Evaluate model
net = net.eval()  # set eval mode
acc = accuracy(net, test_x, test_y, 0.15)
print("Accuracy on test data = %0.2f%%" % acc)
# 5. Save model - TODO
# 6. Use model
raw_inpt = np.array([[0.09266, 34, 6.09, 0, 0.433, 6.495, 18.4,
5.4917, 7, 329, 16.1, 383.61, 8.67]], dtype=np.float32)
norm_inpt = np.array([[0.000970, 0.340000, 0.198148, -1,
0.098765, 0.562177, 0.159629, 0.396666, 0.260870, 0.270992,
0.372340, 0.966488, 0.191501]], dtype=np.float32)
X = T.Tensor(norm_inpt)
y = net(X)
print("For a town with raw input values: ")
for (idx,val) in enumerate(raw_inpt[0]):
if idx % 5 == 0: print("")
print("%11.6f " % val, end="")
print("\n\nPredicted median house price = \$%0.2f" %
(y.item()*10000))
if __name__=="__main__":
main()
``````

The demo imports the entire PyTorch package and assigns it an alias of T. An alternative is to import just the modules and functions needed.

The demo defines a helper function called accuracy. When using a regression model, there’s no inherent definition of the accuracy of a prediction. You must define how close a predicted value must be to a target value in order to be counted as a correct prediction. The demo program counts a predicted median house price as correct if it’s within 15 percent of the true value.

All the control logic for the demo program is contained in a single main function. Program execution begins by setting the global NumPy and PyTorch random seeds so results will be reproducible.

``````train_x = np.loadtxt(train_file, delimiter="\t",
usecols=range(0,13), dtype=np.float32)
usecols=[13], dtype=np.float32)
usecols=range(0,13), dtype=np.float32)
usecols=[13], dtype=np.float32)
``````

The code assumes that the data is located in a subdirectory named Data. The demo data was preprocessed by splitting it into training and test sets. Data wrangling isn’t conceptually difficult, but it’s almost always quite time-consuming and annoying. Many of my colleagues like to use the pandas (Python data analysis) package to manipulate data.

## Defining the Neural Network

The demo defines the 13-(10-10)-1 neural network in a program-­defined class named Net that inherits from the nn.Module module. You can think of the Python __init__ function as a class constructor. Notice that you don’t explicitly define an input layer because input values are fed directly to the first hidden layer.

The network has (13 * 10) + (10 * 10) + (10 * 1) = 240 weights. Each weight is initialized to a small random value using the Xavier Uniform algorithm. The network has 10 + 10 + 1 = 21 biases. Each bias value is initialized to zero.

The Net class forward function defines how the layers compute output. The demo uses tanh (hyperbolic tangent) activation on the two hidden layers, and no activation on the output layer:

``````def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = self.oupt(z)
return z
``````

For hidden layer activation, the main alternative is rectified linear unit (ReLU) activation, but there are many other functions.

Because PyTorch works at a relatively low level of abstraction, there are many alternative design patterns you can use. For example, instead of defining a class Net with the __init__ and forward functions, and then instantiating with net = Net(), you can use the Sequential function, like so:

``````net = T.nn.Sequential(
T.nn.Linear(13,10),
T.nn.Tanh(),
T.nn.Linear(10,10),
T.nn.Tanh(),
T.nn.Linear(10,1))
``````

The Sequential approach is much simpler, but notice you don’t have direct control over the weight and bias initialization algorithms. The tremendous flexibility you get when using PyTorch is an advantage once you become familiar with the library.

## Training the Model

Training the model begins with these seven statements:

``````net = net.train()  # Set training mode
bat_size = 10
loss_func = T.nn.MSELoss()  # Mean squared error
optimizer = T.optim.SGD(net.parameters(), lr=0.01)
n_items = len(train_x)
batches_per_epoch = n_items // bat_size
max_batches = 1000 * batches_per_epoch
``````

PyTorch has two modes: train and eval. The default mode is train, but in my opinion it’s a good practice to explicitly set the mode. The batch (often called mini-batch) size is a hyperparameter. For a regression problem, mean squared error is the most common loss function. The stochastic gradient descent (SGD) algorithm is the most rudimentary technique and in many situations the Adam algorithm gives better results.

The demo program uses a simple approach for batching training items. For the demo, there are about 400 training items, so if the batch size is 10, on average visiting each training item once (this is usually called an epoch in machine learning terminology) will require 400 / 10 = 40 batches. Therefore, to train the equivalent of 1,000 epochs, the demo program needs 1000 * 40 = 40,000 batches.

The core training statements are:

``````for b in range(max_batches):
curr_bat = np.random.choice(n_items, bat_size,
replace=False)
X = T.Tensor(train_x[curr_bat])
Y = T.Tensor(train_y[curr_bat]).view(bat_size,1)
oupt = net(X)
loss_obj = loss_func(oupt, Y)
optimizer.step()     # Update weights and biases
``````

The choice function selects 10 random indices from the 404 available training items. The items are converted from NumPy arrays to PyTorch tensors. You can think of a tensor as a multi­dimensional array that can be efficiently processed by a GPU (even though the demo doesn’t take advantage of a GPU). The oddly named view function reshapes the one-dimensional target values into a two-dimensional tensor. Converting NumPy arrays to PyTorch tensors, and dealing with array and tensor shapes is a major challenge when working with PyTorch.

Once every 4,000 batches the demo program displays the value of the mean squared error loss for the current batch of 10 training items, and the prediction accuracy of the model, using the current weights and biases on the entire 404-item training dataset:

``````if b % (max_batches // 10) == 0:
print("batch = %6d" % b, end="")
print("  batch loss = %7.4f" % loss_obj.item(), end="")
net = net.eval()
acc = accuracy(net, train_x, train_y, 0.15)
net = net.train()
print("  accuracy = %0.2f%%" % acc)
``````

The “//” operator is integer division in Python. Before calling the program-defined accuracy function, the demo sets the network into eval mode. Technically, this isn’t necessary because train and eval modes only give different results if the network uses dropout or layer batch normalization.

## Evaluating and Using the Trained Model

After training completes, the demo program evaluates the prediction accuracy of the model on the test datasets:

``````net = net.eval()  # set eval mode
acc = accuracy(net, test_x, test_y, 0.15)
print("Accuracy on test data = %0.2f%%" % acc)
``````

The eval function returns a reference to the model on which it’s applied; it could have been called without the assignment statement.

In most situations, after training a model you want to save the model for later use. Saving a trained PyTorch model is a bit outside the scope of this article, but you can find several examples in the PyTorch documentation.

The whole point of training a regression model is to use it to make a prediction. The demo program makes a prediction using the first data item from the 102 test items:

``````raw_inpt = np.array([[0.09266, 34, 6.09, 0, 0.433, 6.495, 18.4,
5.4917, 7, 329, 16.1, 383.61, 8.67]], dtype=np.float32)
norm_inpt = np.array([[0.000970, 0.340000, 0.198148, -1,
0.098765, 0.562177, 0.159629, 0.396666, 0.260870, 0.270992,
0.372340, 0.966488, 0.191501]], dtype=np.float32)
``````

When you have new data, you must remember to normalize the predictor values in the same way that the training data was normalized. For min-max normalization, that means you need to save the min and max value for every variable that was normalized.

The demo concludes by making and displaying the prediction:

``````...
X = T.Tensor(norm_inpt)
y = net(X)
print("Predicted = \$%0.2f" % (y.item()*10000))
if __name__=="__main__":
main()
``````

The predicted value is returned as a tensor with a single value. The item function is used to access the value so it can be displayed.

## Wrapping Up

The PyTorch library is somewhat less mature than alternatives TensorFlow, Keras and CNTK, especially with regard to example code. But among my colleagues, the use of PyTorch is growing very quickly. I expect this trend to continue and high-quality examples will become increasingly available to you.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd