Sentiment Analysis 1 - Data Loading with Pandas

In this example, we develop a binary classifier using the manually generated Twitter data to detect the sentiment of each tweet. For example, "This is awesome!" will be a positive one and "I am sad" will be negative. The input data is the text and we use NimbusML NGramFeaturizer to extract numeric features and input them to a AveragedPerceptron classifier.

Loading Data

import time
import pandas as pd
import numpy as np
import os
from IPython.display import display
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.text import NGramFeaturizer
from nimbusml.feature_extraction.text.extractor import Ngram
from nimbusml.linear_model import AveragedPerceptronBinaryClassifier
from nimbusml.decomposition import PcaTransformer
from nimbusml import Pipeline
# Load data from package
trainDataFile = get_dataset('gen_twittertrain').as_filepath()
testDataFile = get_dataset('gen_twittertest').as_filepath()
print("Train data file path: " + str(os.path.basename(trainDataFile)))
print("Test data file path: " + str(os.path.basename(testDataFile)))

trainData = pd.read_csv(trainDataFile, sep = "\t")
testData = pd.read_csv(testDataFile, sep = "\t")

trainData.head()
Train data file path: train-twitter.gen-sample.tsv
Test data file path: test-twitter.gen-sample.tsv
Sentiment Text Label
0 Negative Oh you are hurting me 0
1 Positive So long 1
2 Positive Ths sofa is comfortable 1
3 Negative The place suck. No? 0
4 Positive @fakeid "Chillin" I love it!! 1

We use the "Text" column as the input feature and the "Sentiment" column as the label column (after converting to numeric).

Preprocessing

The NGramFeaturizer transform produces a bag of counts of sequences of consecutive words, called n-grams, from a given corpus of text. The word counts are then normalized using term frequency-inverse document frequency (TF-IDF) method.

In NimbusML, the user can specify the input column names for each operator to be executed on. If not, all the columns from the previous operator or the origin dataset will be used. In Tutorial 2.2, the column syntax of nimbusml will be discussed in more details.

For text featurizer, since the output has multiple columns, for visualization, the names for those will become "output_col_name.[word sequence] " to represent the count for word sequence [word sequence] after normalization. In this example, we train the model with only one column, column "Text".

featurizer = NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'TfIdf'))

Then we can call .fit_transform() to train the featurizer.

text_transformed = featurizer.fit_transform(trainData["Text"].to_frame()) # Using one column as input
print(text_transformed.shape)
text_transformed.head(5)
(71, 1007)
Text.Char.<␂>|o|h Text.Char.o|h|<␠> Text.Char.h|<␠>|y Text.Char.<␠>|y|o Text.Char.y|o|u Text.Char.o|u|<␠> Text.Char.u|<␠>|a Text.Char.<␠>|a|r Text.Char.a|r|e Text.Char.r|e|<␠> ... Text.Word.upset Text.Word.vocation Text.Word.flight Text.Word.late Text.Word.again Text.Word.commute Text.Word.died Text.Word.cancer Text.Word.oh, Text.Word.finally
0 0.218218 0.218218 0.218218 0.218218 0.218218 0.218218 0.218218 0.218218 0.218218 0.218218 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1007 columns

We can see that, all the columns are the generated features from the original "Text" column. Based on those features, we can train a binary classifier.

Binary Classifier

The user can use the transformed data as the input to the binary classifier using .fit(X,Y).

                                        ag = AveragedPerceptronBinaryClassifier()
                                        ag.fit(text_transformed, 1 * (trainData["Sentiment"] == "Positive"))
                                        

The user can also use NimbusML pipeline to train the featurizer and the learner together.

t0 = time.time()

ppl = Pipeline([
                NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'Tf')), 
                PcaTransformer(rank = 100),
                AveragedPerceptronBinaryClassifier(l2_regularization=0.4,
                                                   number_of_iterations=5),
               ])

ppl.fit(trainData["Text"], trainData["Label"]) #will replace with series if supported

print("Training time: "  + str(round(time.time() - t0, 2)))
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Training calibrator.
Elapsed time: 00:00:00.7693293
Training time: 1.0

Using the NimbusML pipeline, we can call ppl.test(test_X,test_Y)

metrics, scores = ppl.test(testData["Text"], testData["Label"], output_scores = True) #replace with series 
print("Performance metrics: ")
display(metrics)
print("Individual scores: ")

# Append origin text to the score
scores["OriginText"] = testData["Text"]
scores["Sentiment"] = testData["Sentiment"]

display(scores[0:5])
print("Total runtime: "  + str(round(time.time() - t0, 2)))
Performance metrics: 
AUC Accuracy Positive precision Positive recall Negative precision Negative recall Log-loss Log-loss reduction Test-set entropy (prior Log-Loss/instance) F1 Score AUPRC
0 0.621295 0.662791 0 0 0.662791 1 1.006758 -0.091782 0.922123 NaN 0.529486
Individual scores: 
PredictedLabel Score Probability OriginText Sentiment
0 0 -0.128473 0.135174 @faketwitterid I am sad Negative
1 0 -0.125328 0.149867 @wakeup_you It is a very simple twit I created Negative
2 0 -0.114177 0.212648 @anotherfakeid I would love to see the latest ... Positive
3 0 -0.164016 0.038579 Oh my ladygaga! I haven't played tennis for 2 ... Negative
4 0 -0.091164 0.394444 I am heading on a road trip and taking a few d... Positive
Total runtime: 4.03