October 2018

Volume 33 Number 10

Test Run - Sentiment Analysis Using CNTK

By James McCaffrey

James McCaffreyImagine you have text data, such as a collection of e-mail messages or online product reviews, and you want to determine if the overall feeling is positive or negative (or possibly neutral). That’s the goal of sentiment analysis. In this article I show you how to create a sentiment analysis system using the Microsoft CNTK code library.

Take a look at the screenshot in Figure 1 to see where this article is headed. The demo program uses the IMDB movie review dataset, which has a total of 50,000 reviews. There are 25,000 reviews in the training set and 25,000 reviews in the test set. Each set has 12,500 positive reviews and 12,500 negative reviews.

Sentiment Analysis Using CNTK
Figure 1 Sentiment Analysis Using CNTK

The demo program uses a small subset of the IMDB dataset—only reviews that have 50 words or less. Behind the scenes, the demo uses the CNTK library to create a long, short-term memory (LSTM) neural network and trains it using 400 iterations. After training, the model’s accuracy on the held-out test reviews is 60.12 percent—not very good because of the small dataset size. The demo concludes by making a prediction for a previously unseen review of “I like this movie.” The numeric prediction is (0.2740, 0.7259) and because the second value is larger than the first, the prediction is that this is a positive review.

This article assumes you have intermediate or better programming skill with a C-family language and a basic familiarity with neural networks, but doesn’t assume you know much about LSTM networks. The entire demo code is presented in the article, and the associated data files are available in the accompanying download.

Understanding the Data

The key to understanding the sentiment analysis demo is understanding the structure of the data files. Suppose you have just two reviews: “I like this movie” and “Movie is terrible.” A CNTK-format data file would look like:

0 |x 12:1 |y 0 1
0 |x 407:1
0 |x 13:1
0 |x 20:1
1 |x 20:1 |y 1 0
1 |x 9:1
1 |x 387:1

The first value on each line is a sequence number. The “|x” and “|y” tags indicate the start of input and output values, respectively. For output, a negative review is encoded as (1, 0) and a positive review is encoded as (0, 1). Each word has an integer index value: “i” = 12, “like” = 407, “this” = 13, “movie” = 20, “is” = 9 and “terrible” = 387.

The LSTM model expects input to be one-hot encoded. This is a vector consisting of all zeros except for a single one value at the index of the associated word. For example, if an entire vocabulary consists of just the four words “good,” “bad,” “ugly,” “neutral,” then “good” = (1, 0, 0, 0), “bad” = (0, 1, 0, 0), “ugly” = (0, 0, 1, 0) and “neutral” = (0, 0, 0, 1).

The IMDB movie review dataset can be found in several locations on the Internet. The dataset has 129,888 distinct words in the training set. Therefore, if you used a direct one-hot encoding approach, each word would be encoded as a huge vector with 129,887 0s and a single 1, which is extremely inefficient. CNTK supports a sparse format. For example, the 12:1 means, “place a 1 at index 12 and make all the other values 0.” The total number of digits must be inferred or supplied elsewhere.

In principle, the index values for each word can be anything because they act only as unique IDs. However, the demo program encodes each word using a scheme that’s fairly common for natural language systems. Words are encoded according to frequency in the source corpus, where 4 is assigned to the most frequent word, 5 is assigned to the second most frequent word and so on.

A value of 0 is reserved for padding in situations where you want all input sequences to have the same length. A value of 1 is used for “start of sequence” in situations where data isn’t organized with an explicit delimiter (typically a newline). A value of 2 is used for “out of vocabulary” for words that are unknown because they didn’t occur in a training dataset. A value of 3 is reserved for custom usage.

Before starting to work on the LSTM model, I wrote a custom program to generate a file of training data and a file of test data. In very high-level pseudo-code, here’s what it does:

read all 50,000 training and test reviews into memory,
  removing punctuation and converting to lowercase.
create a vocabulary collection of all unique words
  in training data.
sort the collection by word frequency,
  1 = most frequent.
loop through each review
  if review has more than 50 words, skip it
  write output to train, test file in CNTK format.
end-loop.

There are a lot of details to consider when preparing data for a sentiment analysis model. I removed all punctuation characters except for the single-quote character to retain words such as don’t and wouldn’t. All words in the reviews were converted to lowercase. I didn’t remove any stop words such as “and” and “the.” The final training file contains 620 reviews and the test file has 667 reviews. I also created a tiny file containing two reviews of “I like this movie” and “Movie is terrible.”

The IMDB movie review dataset is binary because reviews are either positive or negative. For binary classification, it’s common to encode negative as 0 and positive as 1. However, in order to make the demo program easily adaptable to a multi-class problem (for example, where a review can be positive, negative or neutral) I encoded negative as (1, 0) and positive as (0, 1). 

Understanding LSTM Networks

A sentence is a sequence of words. In most cases the meaning of a word is highly dependent on the preceding words. For example, suppose you have a review of, “I wish I could say this was a great movie.” A naive approach that just looks at single words would see “great” and probably conclude the review is positive. LSTM networks have a memory, so they can handle sequence data, specifically sentences.

There are many variations of LSTM networks and closely related networks, such as gated recurrent units (GRUs). You can get an idea of how LSTMs work by examining the diagram in Figure 2. The diagram shows a simplified LSTM cell. As you’ll see shortly, an LSTM network consists of one or more LSTM cells plus additional plumbing.

A Simplified LSTM Cell 
Figure 2 A Simplified LSTM Cell

The x(t) is the input at time t, which for sentiment analysis is a word in a sentence. The h(t) is the output at time t. The c(t) is the cell state, or memory, at time t. In the diagram, the cell state is shown as having the same size as the output vector, but in most cases the cell state will be larger (so the LSTM would require some additional components to scale the vector sizes correctly). Notice that the output at time t depends on the input and the cell state, and that the output at time t-1 and cell state at time t-1 contribute to both cell state and cell output.

Overall Program Structure

Installing CNTK involves two main steps. First, you install a Python distribution, then you install CNTK as an add-on package. A Python distribution consists of a specific version of the core Python interpreter plus several hundred useful packages that have been verified to work with the Python version and with each other. In the open source world, version incompatibilities can be a nightmare.

I used the Anaconda3 4.1.1 distribution that has Python 3.5.2. You can find this by doing an Internet search for “anaconda archive.” After installing Anaconda, I searched for “cntk python” and found a .whl installer file for CNTK 2.4 and downloaded it to my local machine. You can think of a .whl file as somewhat similar to a Windows .msi file. I installed CNTK 2.4 by launching a command shell and then using the pip utility:

> pip install C:\MyWheels\cntk-2.4-cp35-cp35m-win_amd64.whl

The structure of the demo program, with a few minor edits to save space, is shown in Figure 3. I use Notepad as my CNTK editor of choice, but most of my colleagues prefer something more sophisticated.

Figure 3 Overall Program Structure

# imdb_lstm.py
import numpy as np
import cntk as C
def create_reader(path, input_dim, output_dim,
  is_random, sweeps):
def main():
  # get started
  print("Begin IMDB demo")
  np.random.seed(1)
  # define LSM model
  # train model
  # evaluate model
  # make a prediction
if __name__ == "__main__":
  main()

Program-defined function create_reader looks for tag “|x” and “|y” in a CNTK format data file. The definition is shown in Figure 4.

Figure 4 The create_reader Definition

def create_reader(path, input_dim, output_dim,
  is_random, sweeps):
  x_strm = C.io.StreamDef(field='x', shape=input_dim,
    is_sparse=True)
  y_strm = C.io.StreamDef(field='y',
   shape=output_dim, is_sparse=False)
  streams = C.io.StreamDefs(x_src=x_strm,
    y_src=y_strm)
  deserial = C.io.CTFDeserializer(path, streams)
  mb_source = C.io.MinibatchSource(deserial,
    randomize=is_random, max_sweeps=sweeps)
  return mb_source

Notice that the x-input values are indicated as sparse, but the y-output values are not. The main function sets the global NumPy random number generator seed to 1 so that results are reproducible. Then the demo sets up the key variables:

train_file = ".\\Data\\imdb_sparse_train_50w.txt"
test_file = ".\\Data\\imdb_sparse_test_50w.txt"
input_dim = 129888 + 4
output_dim = 2
X = C.sequence.input_variable(shape=input_dim,
  is_sparse=True)
Y = C.ops.input_variable(output_dim)

Recall that the IMDB training dataset has 129,888 distinct words and that the datafiles use an offset of 4. If you’re performing sentiment analysis on your own data, you’ll need to remember to record the number of distinct words when you create your training and test data files. Next, the demo sets up a data reader for the training data:

rdr = create_reader(train_file, input_dim,
  output_dim, is_random=True, sweeps=C.io.INFINITELY_REPEAT)
imdb_map = {
  X : rdr.streams.x_src,
  Y : rdr.streams.y_src
}

It’s important to process training data in a random order. The sweeps parameter is set so that the training data can be traversed multiple times.

Defining the LSTM Model

The demo program creates the LSTM model with these statements:

my_init = C.initializer.glorot_uniform(seed=1)
lstm = C.layers.Sequential([
  C.layers.Embedding(50, init=my_init),
  C.layers.Recurrence(C.layers.LSTM(25)),
  C.sequence.last,
  C.layers.Dense(output_dim, init=my_init)])
model = lstm(X)

LSTM networks are often highly sensitive to how the model weights are initialized. The glorot_uniform algorithm is often used, but there are several alternatives, including random uniform and random normal.

The Embedding layer is critical. In theory, you can pass the raw index values that represent words directly to an LSTM network, but in practice you should convert each word into a vector of multiple values. In this case, the Embedding layer will convert each input word into a vector of 50 values.

The code sets up a single LSTM cell with a cell state/memory size of 25. The embedding vector size and the LSTM cell size are free parameters and good values must be determined by trial and error. The model’s last layer uses the Dense function so that the output is constrained to two values because the target output is either (1, 0) or (0, 1).

Training and Evaluating the LSTM Model

The demo prepares training with these statements:

max_iter = 400
batch_size = 10 * 40  # ~9 sequences
lr = C.learning_parameter_schedule_per_sample(0.1)
sgd = C.sgd(model.parameters, lr)
tr_loss = C.cross_entropy_with_softmax(model, Y)
tr_accu = C.classification_error(model, Y)
trainer = C.Trainer(model, (tr_loss,tr_accu), [sgd])

Because an average review length is about 40 words, setting the batch_size to 400 will fetch approximately nine or 10 reviews per training batch. For simplicity, the demo uses the basic stochastic gradient descent algorithm, which is rarely the best choice for LSTM networks. The Adam, AdaGrad and RMSprop algorithms often work better. Figure 5 shows how training is performed.

Figure 5 Training the LSTM Model

for i in range(max_iter):
  mb = rdr.next_minibatch(batch_size, input_map=imdb_map)
  trainer.train_minibatch(mb)
  if i % int(max_iter/5) == 0:
    print("i = %d" % i)
    num_seqs = mb[Y].num_sequences
    print("curr mini-batch has %d sequences" % num_seqs)
    curr_class_err = (1.0 - \
trainer.previous_minibatch_evaluation_average) * 100
    print("accuracy curr mb = %0.2f%%" % curr_class_err)
    curr_loss = trainer.previous_minibatch_loss_average
    print("loss curr mb = %0.4f" % curr_loss)

A progress message is displayed every 400 iterations / 5 = 80 iterations to monitor loss and accuracy on the current batch of training items. After training completes, the demo evaluates the model by applying it to the test data. First, a new reader object is created:

rdr = create_reader(test_file, input_dim,
  output_dim, is_random=False, sweeps=1)
imdb_map = {
  X : rdr.streams.x_src,
  Y : rdr.streams.y_src
}

Notice that there’s no need to traverse the test data in random order, and you only need to traverse one time. The classification accuracy is computed and displayed:

num_to_test = 500 * 25000
all_test = rdr.next_minibatch(num_to_test,
  input_map=imdb_map)
class_acc = (1.0 - trainer.test_minibatch(all_test)) * 100
print("Classification accuracy on all test items = \
 %0.2f%%" % class_acc)

The num_to_test variable is set to capture far more reviews than there are in the test data. Because the reader is configured to traverse the test data only once, the excess count will be ignored. One of the minor quirks of CNTK is that there’s a classification error function (percentage incorrect) rather than a classification accuracy function, so the demo subtracts from 1 to get an accuracy.

Making a Prediction

To make a sentiment prediction on a new, previously unseen review, the easiest approach is to set up a file that has the same structure as the training or test data. The demo sets up a reader for the file:

review_file = ".\\Data\\my_reviews_50w.txt"
rdr = create_reader(review_file, input_dim,
  output_dim, is_random=False, sweeps=1)
imdb_map = {
  X : rdr.streams.x_src,
  Y : rdr.streams.y_src
}

Next, a single review is read into memory as a CNTK mini-batch object and output is computed using the eval method:

review_mb = rdr.next_minibatch(1, input_map=imdb_map)
model = C.ops.softmax(model)
predicted = model.eval(review_mb[X])
print(predicted)

Before calling eval, the model is modified using the softmax function. This coerces the two output values to sum to 1.0, which makes interpreting the result a bit easier.

Wrapping Up

As recently as about 24 months ago, creating a custom sentiment analysis model would have been considered an extreme leading-edge application of deep learning, and you likely would have had to use a canned solution such as Azure Cognitive Services. But the emergence of the CNTK library, as well as similar libraries such as Keras/TensorFlow, make custom sentiment analysis feasible. To be sure, sentiment analysis is not an easy problem, but even so, the demo program presented in this article should give you all the information you need to get started on a production-quality system.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Joey Carson, Si-Qing Chen, Eunice Kim, Lucas Meyer


Discuss this article in the MSDN Magazine forum