April 2019

Volume 34 Number 4

### [Test Run]

# Neural Anomaly Detection Using PyTorch

Anomaly detection, also called outlier detection, is the process of finding rare items in a dataset. Examples include identifying malicious events in a server log file and finding fraudulent online advertising.

A good way to see where this article is headed is to take a look at the demo program in **Figure 1**. The demo analyzes a 1,000-item subset of the well-known Modified National Institute of Standards and Technology (MNIST) dataset. Each data item is a 28x28 grayscale image (784 pixels) of a handwritten digit from zero to nine. The full MNIST dataset has 60,000 training images and 10,000 test images.

**Figure 1 MNSIT Image Anomaly Detection Using Keras**

The demo program creates and trains a 784-100-50-100-784 deep neural autoencoder using the PyTorch code library. An autoencoder is a neural network that learns to predict its input. After training, the demo scans through 1,000 images and finds the one image that’s most anomalous, where most anomalous means highest reconstruction error. The most anomalous digit is a three that looks like it could be an eight instead.

This article assumes you have intermediate or better programming skill with a C-family language and a basic familiarity with machine learning, but doesn’t assume you know anything about autoencoders. All the demo code is presented in this article. The code and data are also available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

## Installing PyTorch

PyTorch is a relatively low-level code library for creating neural networks. It’s roughly similar in terms of functionality to TensorFlow and CNTK. PyTorch is written in C++, but has a Python language API for easier programming.

Installing PyTorch includes two main steps. First, you install Python and several required auxiliary packages, such as NumPy and SciPy. Then you install PyTorch as a Python add-on package. Although it’s possible to install Python and the packages required to run PyTorch separately, it’s much better to install a Python distribution, which is a collection containing the base Python interpreter and additional packages that are compatible with each other. For my demo, I installed the Anaconda3 5.2.0 distribution, which contains Python 3.6.5.

After installing Anaconda, I went to the pytorch.org Web site and selected the options for the Windows OS, Pip installer, Python 3.6 and no CUDA GPU version. This gave me a URL that pointed to the corresponding .whl (pronounced “wheel”) file, which I downloaded to my local machine. If you’re new to the Python ecosystem, you can think of a Python .whl file as somewhat similar to a Windows .msi file. In my case, I downloaded PyTorch version 1.0.0. I opened a command shell, navigated to the directory holding the .whl file and entered the command:

```
pip install torch-1.0.0-cp36-cp36m-win_amd64.whl
```

## The Demo Program

The complete demo program, with a few minor edits to save space, is presented in **Figure 2**. I indent with two spaces rather than the usual four spaces to save space. Note that Python uses the “\” character for line continuation. I used Notepad to edit my program. Most of my colleagues prefer a more sophisticated editor, but I like the brutal simplicity of Notepad.

Figure 2 The Anomaly Detection Demo Program

```
# auto_anom_mnist.py
# PyTorch 1.0.0 Anaconda3 5.2.0 (Python 3.6.5)
# autoencoder anomaly detection on MNIST
import numpy as np
import torch as T
import matplotlib.pyplot as plt
# -----------------------------------------------------------
def display(raw_data_x, raw_data_y, idx):
label = raw_data_y[idx] # like '5'
print("digit/label = ", str(label), "\n")
pixels = np.array(raw_data_x[idx]) # target row of pixels
pixels = pixels.reshape((28,28))
plt.rcParams['toolbar'] = 'None'
plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
plt.show()
# -----------------------------------------------------------
class Batcher:
def __init__(self, num_items, batch_size, seed=0):
self.indices = np.arange(num_items)
self.num_items = num_items
self.batch_size = batch_size
self.rnd = np.random.RandomState(seed)
self.rnd.shuffle(self.indices)
self.ptr = 0
def __iter__(self):
return self
def __next__(self):
if self.ptr + self.batch_size > self.num_items:
self.rnd.shuffle(self.indices)
self.ptr = 0
raise StopIteration # ugh.
else:
result = self.indices[self.ptr:self.ptr+self.batch_size]
self.ptr += self.batch_size
return result
# -----------------------------------------------------------
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layer1 = T.nn.Linear(784, 100) # hidden 1
self.layer2 = T.nn.Linear(100, 50)
self.layer3 = T.nn.Linear(50,100)
self.layer4 = T.nn.Linear(100, 784)
T.nn.init.xavier_uniform_(self.layer1.weight) # glorot
T.nn.init.zeros_(self.layer1.bias)
T.nn.init.xavier_uniform_(self.layer2.weight)
T.nn.init.zeros_(self.layer2.bias)
T.nn.init.xavier_uniform_(self.layer3.weight)
T.nn.init.zeros_(self.layer3.bias)
T.nn.init.xavier_uniform_(self.layer4.weight)
T.nn.init.zeros_(self.layer4.bias)
def forward(self, x):
z = T.tanh(self.layer1(x))
z = T.tanh(self.layer2(z))
z = T.tanh(self.layer3(z))
z = T.tanh(self.layer4(z)) # consider none or sigmoid
return z
# -----------------------------------------------------------
def main():
# 0. get started
print("Begin autoencoder for MNIST anomaly detection")
T.manual_seed(1)
np.random.seed(1)
# 1. load data
print("Loading MNIST subset data into memory ")
data_file = ".\\Data\\mnist_pytorch_1000.txt"
data_x = np.loadtxt(data_file, delimiter=" ",
usecols=range(2,786), dtype=np.float32)
labels = np.loadtxt(data_file, delimiter=" ",
usecols=[0], dtype=np.float32)
norm_x = data_x / 255
# 2. create autoencoder model
net = Net()
# 3. train autoencoder model
net = net.train() # explicitly set
bat_size = 40
loss_func = T.nn.MSELoss()
optimizer = T.optim.Adam(net.parameters(), lr=0.01)
batcher = Batcher(num_items=len(norm_x),
batch_size=bat_size, seed=1)
max_epochs = 100
print("Starting training")
for epoch in range(0, max_epochs):
if epoch > 0 and epoch % (max_epochs/10) == 0:
print("epoch = %6d" % epoch, end="")
print(" prev batch loss = %7.4f" % loss_obj.item())
for curr_bat in batcher:
X = T.Tensor(norm_x[curr_bat])
optimizer.zero_grad()
oupt = net(X)
loss_obj = loss_func(oupt, X) # note X not Y
loss_obj.backward()
optimizer.step()
print("Training complete")
# 4. analyze - find item(s) with large(st) error
net = net.eval() # not needed - no dropout
X = T.Tensor(norm_x) # all input item as Tensors
Y = net(X) # all outputs as Tensors
N = len(data_x)
max_se = 0.0; max_ix = 0
for i in range(N):
curr_se = T.sum((X[i]-Y[i])*(X[i]-Y[i]))
if curr_se.item() > max_se:
max_se = curr_se.item()
max_ix = i
raw_data_x = data_x.astype(np.int)
raw_data_y = labels.astype(np.int)
print("Highest reconstruction error is index ", max_ix)
display(raw_data_x, raw_data_y, max_ix)
print("End autoencoder anomaly detection demo ")
# -----------------------------------------------------------
if __name__ == "__main__":
main()
```

The demo program starts by importing the NumPy, PyTorch and Matplotlib packages. The Matplotlib package is used to visually display the most anomalous digit that’s found by the model. An alternative to importing the entire PyTorch package is to import just the necessary modules, for example, import torch.optim as opt.

## Loading Data into Memory

Working with the raw MNIST data is rather difficult because it’s saved in a proprietary, binary format. I wrote a utility program to extract the first 1,000 items from the 60,000 training items. I saved the data as mnist_pytorch_1000.txt in a Data subdirectory.

The resulting data looks like this:

```
7 = 0 255 67 . . 123
2 = 113 28 0 . . 206
...
9 = 0 21 110 . . 254
```

Each line represents one digit. The first value on each line is the digit. The second value is an arbitrary equal-sign character just for readability. The next 28x28 = 784 values are grayscale pixel values between zero and 255. All values are separated by a single blank-space character. **Figure 3** shows the data item at index [30] in the data file, which is a typical “3” digit.

**Figure 3 A Typical MNIST Digit**

The dataset is loaded into memory with these statements:

```
data_file = ".\\Data\\mnist_pytorch_1000.txt"
data_x = np.loadtxt(data_file, delimiter=" ",
usecols=range(2,786), dtype=np.float32)
labels = np.loadtxt(data_file, delimiter=" ",
usecols=[0], dtype=np.float32)
norm_x = data_x / 255
```

Notice the digit/label is in column zero and the 784 pixel values are in columns two to 785. After all 1,000 images are loaded into memory, a normalized version of the data is created by dividing each pixel value by 255 so that the scaled pixel values are all between 0.0 and 1.0.

## Defining the Autoencoder Model

The demo program defines a 784-100-50-100-784 autoencoder. The number of nodes in the input and output layers (784) is determined by the data, but the number of hidden layers and the number of nodes in each layer are hyperparameters that must be determined by trial and error.

The demo program uses a program-defined class, Net, to define the layer architecture and the input-output mechanism of the autoencoder. An alternative is to create the autoencoder directly by using the Sequence function, for example:

```
net = T.nn.Sequential(
T.nn.Linear(784,100), T.nn.Tanh(),
T.nn.Linear(100,50), T.nn.Tanh(),
T.nn.Linear(50,100), T.nn.Tanh(),
T.nn.Linear(100,784), T.nn.Tanh())
```

The weight initialization algorithm (Glorot uniform), the hidden layer activation function (tanh) and the output layer activation function (tanh) are hyperparameters. Because all input and output values are between 0.0 and 1.0 for this problem, logistic sigmoid is a good alternative to explore for output activation.

## Training and Evaluating the Autoencoder Model

The demo program prepares training with these statements:

```
net = net.train() # explicitly set
bat_size = 40
loss_func = T.nn.MSELoss()
optimizer = T.optim.Adam(net.parameters(), lr=0.01)
batcher = Batcher(num_items=len(norm_x),
batch_size=bat_size, seed=1)
max_epochs = 100
```

Because the demo autoencoder doesn’t use dropout or batch normalization, it isn’t necessary to explicitly set the network into train mode, but in my opinion it’s good style to do so. The batch size (40), training optimization algorithm (Adam), initial learning rate (0.01) and maximum number of epochs (100) are all hyperparameters. If you’re new to neural machine learning, you might be thinking, “Neural networks sure have a lot of hyperparameters,” and you’d be correct.

The program-defined Batcher object serves up the indices of 40 random data items at a time until all 1,000 items have been processed (one epoch). An alternative approach is to use the built-in Dataset and DataLoader objects in the torch.utils.data module.

The structure of the training process is:

```
for epoch in range(0, max_epochs):
# print loss every 10 epochs
for curr_bat in batcher:
X = T.Tensor(norm_x[curr_bat])
optimizer.zero_grad()
oupt = net(X)
loss_obj = loss_func(oupt, X)
loss_obj.backward()
optimizer.step()
```

Each batch of items is created using the Tensor constructor, which uses torch.float32 as the default data type. Notice the loss_func function compares computed outputs to the inputs, which has the effect of training the network to predict its input values.

After training, you’ll usually want to save the model, but that’s a bit outside the scope of this article. The PyTorch documentation has good examples that show how to save a trained model in several different ways.

When working with autoencoders, in most situations (including this example) there’s no inherent definition of model accuracy. You must determine how close computed output values must be to the associated input values in order to be counted as a correct prediction, and then write a program-defined function to compute your accuracy metric.

## Using the Autoencoder Model to Find Anomalous Data

After the autoencoder model has been trained, the idea is to find data items that are difficult to correctly predict or, equivalently, items that are difficult to reconstruct. The demo code scans through all 1,000 data items and calculates the squared difference between the normalized input values and the computed output values like this:

```
net = net.eval() # not needed - no dropout
X = T.Tensor(norm_x) # all input item as Tensors
Y = net(X) # all outputs as Tensors
N = len(data_x)
max_se = 0.0; max_ix = 0
for i in range(N):
curr_se = T.sum((X[i]-Y[i])*(X[i]-Y[i]))
if curr_se.item() > max_se:
max_se = curr_se.item()
max_ix = i
```

The maximum squared error (max_se) is calculated and the index of the associated image (max_ix) is saved. An alternative to finding the single item with the largest reconstruction error is to save all squared errors, sort them and return the top-n items where the value of n will depend on the particular problem you’re investigating.

After the single-most-anomalous data item has been found, it’s shown using the program-defined display function:

```
raw_data_x = data_x.astype(np.int)
raw_data_y = labels.astype(np.int)
print("Highest reconstruction error is index ", max_ix)
display(raw_data_x, raw_data_y, max_ix)
```

The pixel and label values are converted from type float32 to int mostly as a matter of principle because the Matplotlib imshow function inside the program-defined display function can accept either data type.

## Wrapping Up

Anomaly detection using a deep neural autoencoder, as presented in this article, is not a well-investigated technique. A big advantage of using a neural autoencoder compared to most standard clustering techniques is that neural techniques can handle non-numeric data by encoding that data. Most clustering techniques depend on a numeric measure, such as Euclidean distance, which means the source data must be strictly numeric.

A related but also little-explored technique for anomaly detection is to create an autoencoder for the dataset under investigation. Then, instead of using reconstruction error to find anomalous data, you can cluster the data using a standard algorithm such as k-means because the innermost hidden layer nodes hold a strictly numeric representation of each data item. After clustering, you can look for clusters that have very few data items, or look for data items within clusters that are most distant from their cluster centroid. This approach has characteristics that resemble neural word embedding, where words are converted to numeric vectors that can then be used to compute a distance measure between words.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products, including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd