December 2018

Volume 33 Number 12

[Test Run]

Autoencoders for Visualization Using CNTK

By James McCaffrey

James McCaffreySuppose you have data that describes a set of people. Each person is represented by an age value and a height value. If you wanted to graph your data you could put age on the x-axis, height on the y-axis and use colored dots (say, blue for male and pink for female). The data has just two dimensions so there’s no problem.

But what if your data has six dimensions, such as age, height, weight, income, debt, ZIP code? To graph such data, one approach is to condense the six-dimensional data down to two dimensions using a neural network autoencoder. You’ll lose some information, but you’ll be able to construct a 2D graph.

The best way to see where this article is headed is to take a look at the demo program in Figure 1. The demo data has 1,797 items. Each item has 64 dimensions and falls into one of 10 classes. The demo creates a Microsoft Cognitive Toolkit (CNTK) neural network autoencoder to condense each item down to two dimensions, labeled as component1 and componet2, and then graphs the result. Interesting patterns in the data emerge. For example, data items that are class 0 (black dots) are quite different from class 7 items (orange dots). But class 8 items (red dots) have quite a bit of similarity with class 5 items (light green dots).

Autoencoder Visualization Using CNTK Demo Run
Figure 1 Autoencoder Visualization Using CNTK Demo Run

This article assumes you have intermediate or better programming skill with a C-family language and a basic familiarity with machine learning, but doesn’t assume you know anything about autoencoders. All of the demo code is presented in this article. The complete source code and the data file used by the demo program are also available in the download that accompanies this article. All normal error checking has been removed to keep the main ideas as clear as possible.

Understanding the Data

The demo data looks like:

0,0,0,1,11,0,0,0,0,0,0,7,8, ... 16,4,0,0,4
0,0,9,14,8,1,0,0,0,0,12,14, ... 5,11,1,0,8

There are 1,797 data items, one per line, and each line has 65 comma-delimited values. Each line of data represents an 8x8 crude handwritten digit. The first 64 values on a line are grayscale pixel values between 0 and 16. The last value on a line is the digit value, so the first item has pixel values representing a “4” and the second item has pixel values for an “8.”

The goal of the demo autoencoder is to reduce the 64 dimensions of a data item down to just two values so the item can be plotted as a point on an x-y graph. Even though the demo data represents an image, autoencoders can work with any kind of high-­dimensionality numeric data.

The demo data is called the UCI digits dataset and can be found at, or in the file download that accompanies this article.

Installing CNTK

Microsoft CNTK is a powerful open source neural network code library. CNTK is written in C++ for performance, but has a Python API for convenience and sanity. You don’t install CNTK as a standalone system. First you install a distribution of Python that contains the core Python interpreter and several hundred additional packages. Then you install CNTK as a Python add-on package.

Installing CNTK can be tricky. For the demo, I first installed the Anaconda3 4.1.1 distribution for Windows, using the nice self-extracting executable at The distribution contains Python 3.5.2 and several packages required by CNTK.

Next, I installed CNTK v2.4 by downloading the compatible CPU-only (my machine doesn’t have a GPU) CNTK .whl file to my local machine from Then I opened a command shell, navigated to the directory containing the CNTK .whl file and entered the command:

> pip install cntk-2.4-cp35-cp35m-win_amd64.whl

If you’re new to Python, you can loosely think of a .whl file as somewhat similar to an .msi installation file and pip as somewhat similar to the Windows Installer program. Python is quite brittle with regard to versioning, so you have to be very careful to install compatible versions of Anaconda Python and CNTK. The vast majority of CNTK installation problems I see are directly related to version incompatibilities.

The Demo Program

The structure of the demo program, with a few minor edits to save space, is presented in the listing in Figure 2. I indent with two spaces rather than the usual four spaces to save space. And note that Python uses the “\” character for line continuation. I used Notepad to edit my program. Most of my colleagues prefer a more sophisticated editor, but I appreciate the simplicity of Notepad.

Figure 2 The Autoencoder Demo Program Structure

# condense the UCI digits to two dimensions
# CNTK 2.4 Anaconda3 4.1.1 (Python 3.5.2)
import numpy as np
import cntk as C
import matplotlib.pyplot as plt
def main():
  # 0. get started
  print("Begin UCI digits autoencoder with CNTK demo ")
  # 1. load data into memory
  # 2. define autoencoder
  # 3. train model
  # 4. generate (x,y) pairs for each data item
  # 5. graph the data in 2D
  print("End autoencoder using CNTK demo ")
if __name__ == "__main__":

The demo program is named and it starts by importing the NumPy, CNTK and Pyplot packages. NumPy enables basic numeric operations in Python and Pyplot is used to display the scatter plot shown in Figure 1. After a startup message, program execution begins by setting the global NumPy random seed so results will be reproducible.

The demo loads the training data in memory using the NumPy loadtxt function:

data_file = ".\\Data\\digits_uci_test_1797.txt"
data_x = np.loadtxt(data_file, delimiter=",",
  usecols=range(0,64), dtype=np.float32)
labels = np.loadtxt(data_file, delimiter=",",
  usecols=[64], dtype=np.float32)
data_x = data_x / 16

The code assumes that the data is located in a subdirectory named Data. The loadtxt function has a lot of optional parameters. In this case, the function call specifies that the data is comma-delimited. The float32 datatype is the default for CNTK, so I could’ve omitted specifying it explicitly. The data_x object holds all 1,797 rows of the 64-pixel values and the labels object holds the corresponding 10 class values. The data_x object is modified by dividing by 16 so that all pixel values are scaled between 0.0 and 1.0.

Defining the Autoencoder

The demo creates a 64-32-2-32-64 neural network autoencoder model with these statements:

my_init = C.initializer.glorot_uniform(seed=1)
X = C.ops.input_variable(64, np.float32)  # inputs
layer1 = C.layers.Dense(32, init=my_init,
layer2 = C.layers.Dense(2, init=my_init,
layer3 = C.layers.Dense(32, init=my_init,
layer4 = C.layers.Dense(64, init=my_init,

The initializer object specifies the Glorot algorithm, which often works better than a uniform distribution on deep neural networks. The X object is set up to hold 64 input values. The next four layers create an autoencoder that crunches the 64 input values down to 32 values, and then crunches those down to just two values. An autoencoder is a specialized form of an encoder-decoder network. The diagram in Figure 3 illustrates autoencoder architecture.

Autoencoder Architecture
Figure 3 Autoencoder Architecture

To keep the size of the diagram small, Figure 3 shows a 6-3-2-3-6 autoencoder rather than the 64-32-2-32-64 architecture of the demo autoencoder.

Notice that the output values are the same as the input values so the autoencoder learns to predict its own inputs. The decoder part of the autoencoder expands the two values in the middle layer back up to the original 64 values. The net result of all this is that each 64-dimensional data item is mapped down to just two values. To get at these values the demo program defines:

enc_dec = C.ops.alias(layer4)
  encoder = C.ops.alias(layer2)

In words, the encoder-decoder accepts values in the X object and generates output in layer4 with 64 nodes. The encoder accepts values in the X object and generates output in layer2 with two nodes.

The demo uses sigmoid activation on all layers. This results in all inner-layer nodes being between 0.0 and 1.0. There are many design options for autoencoders. You can use different numbers of inner layers and different numbers of nodes in each layer. You can use a different activation function, including the rarely used linear function. When using an autoencoder for dimensionality reduction for visualization, the quality of the resulting visualization is subjective, so your design choices are mostly a matter of trial and error.

Training the Autoencoder

Training is prepared using these seven statements:

Y = C.ops.input_variable(64, np.float32)  # targets
loss = C.squared_error(enc_dec, Y)
learner = C.adam(enc_dec.parameters, lr=0.005, momentum=0.90)
trainer = C.Trainer(enc_dec, (loss), [learner])
N = len(data_x)
bat_size = 2
max_epochs = 10000

The Y object holds the same values as the X object and the training-loss function compares those values. The demo specifies squared error for the loss function, but because each value is between 0.0 and 1.0 due to the use of the sigmoid activation function, cross-­entropy error could’ve been used.

For deep neural networks, the Adam (adaptive moment estimation) algorithm often performs better than basic stochastic gradient descent. The learning rate value (0.005) and momentum value (0.90) are hyperparameters and must be determined by trial and error. The batch size (two) and maximum number of training iterations (10,000) are also hyperparameters.

Training is performed like so:

for i in range(0, max_epochs):
  rows = np.random.choice(N, bat_size, replace=False)
  trainer.train_minibatch({ X: data_x[rows], Y: data_x[rows] })
  if i > 0 and i % int(max_epochs/10) == 0:
    mse = trainer.previous_minibatch_loss_average
    print("epoch = " + str(i) + " MSE = %0.4f " % mse)

On each training iteration, the random.choice function selects two rows from the 1,797 rows of data. This is a fairly crude approach because some rows may be selected more often than other rows. You could write a more sophisticated batching system, but for auto­encoders the approach used by the demo is simple and effective.

The demo monitors training by displaying the squared error loss on the current batch of two items, every 10,000 / 10 = 1,000 epochs. The idea is to make sure that error tends to decrease during training; however, because the batch size is so small, there’s quite a bit of fluctuation of the loss during training.

Using the Encoder

After training, the demo program generates (x, y) pairs for each of the 1,797 data items:

reduced = encoder.eval(data_x)

The return value is a matrix with 1,797 rows and two columns. Each row represents a reduced dimensionality version of the original data. The values in both columns are between 0.0 and 1.0 because the autoencoder uses sigmoid activation on all layers.

The demo program prepares the visualization with these statements:

print("Displaying 64-dim data in 2D: \n")
plt.scatter(x=reduced[:, 0], y=reduced[:, 1],
  c=labels, edgecolors='none', alpha=0.9,'nipy_spectral', 10), s=20)

The x and y parameters of the scatter function are values for the x-axis and y-axis. The syntax reduced[:, 0] means all rows of the matrix but just the first column. The c parameter specifies colors. In this case the labels object, holding all of the 0 through 9 values for each data item, is passed to c.

The alpha parameter specifies the transparency level of each marker dot. The cmap parameter accepts a color mapping. The values ‘nipy_spectral’ and 10 mean to fetch 10 values from a spectral color gradient that ranges from 0 = black, through 4 = green, to 9 = gray. The Pyplot library supports many different color maps. The s parameter is the size of the marker dots, measured in pixels.

After setting up the scatter plot, the demo program concludes like so:

plt.xlabel('component 1')
  plt.ylabel('component 2')
  print("End autoencoder using CNTK demo ")
if __name__ == "__main__":

The built-in colorbar function uses the 10 values from the nipy_spectral gradient. An alternative approach is to use the legend function.

Wrapping Up

There are several alternative approaches to dimensionality reduction for data visualization. Principal component analysis (PCA) is a classical statistics technique that has been used for decades. A newer technique, dating from 2008, is called t-distributed stochastic neighbor embedding (t-SNE) and it often works well, but only with relatively small datasets.

Autoencoders can be used for purposes other than dimensionality reduction for data visualization. For example, you could use an autoencoder to remove noise from images or documents. The idea is to reduce data down to some sort of core components, removing noise in the process, and then expand the data back, resulting in cleaner data in some sense.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at

Thanks to the following Microsoft experts who reviewed this article: Chris Lee, Ricky Loynd   

Discuss this article in the MSDN Magazine forum