Sentiment Analysis 1 - Data Loading with Pandas

Article
10/13/2022

In this example, we develop a binary classifier using the manually generated Twitter data to detect the sentiment of each tweet. For example, "This is awesome!" will be a positive one and "I am sad" will be negative. The input data is the text and we use NimbusML NGramFeaturizer to extract numeric features and input them to a AveragedPerceptron classifier.

Loading Data

import time
import pandas as pd
import numpy as np
import os
from IPython.display import display
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.text import NGramFeaturizer
from nimbusml.feature_extraction.text.extractor import Ngram
from nimbusml.linear_model import AveragedPerceptronBinaryClassifier
from nimbusml.decomposition import PcaTransformer
from nimbusml import Pipeline

# Load data from package
trainDataFile = get_dataset('gen_twittertrain').as_filepath()
testDataFile = get_dataset('gen_twittertest').as_filepath()
print("Train data file path: " + str(os.path.basename(trainDataFile)))
print("Test data file path: " + str(os.path.basename(testDataFile)))

trainData = pd.read_csv(trainDataFile, sep = "\t")
testData = pd.read_csv(testDataFile, sep = "\t")

trainData.head()

Train data file path: train-twitter.gen-sample.tsv
Test data file path: test-twitter.gen-sample.tsv

	Sentiment	Text	Label
0	Negative	Oh you are hurting me	0
1	Positive	So long	1
2	Positive	Ths sofa is comfortable	1
3	Negative	The place suck. No?	0
4	Positive	@fakeid "Chillin" I love it!!	1

We use the "Text" column as the input feature and the "Sentiment" column as the label column (after converting to numeric).

Preprocessing

The NGramFeaturizer transform produces a bag of counts of sequences of consecutive words, called n-grams, from a given corpus of text. The word counts are then normalized using term frequency-inverse document frequency (TF-IDF) method.

In NimbusML, the user can specify the input column names for each operator to be executed on. If not, all the columns from the previous operator or the origin dataset will be used. In Tutorial 2.2, the column syntax of nimbusml will be discussed in more details.

For text featurizer, since the output has multiple columns, for visualization, the names for those will become "output_col_name.[word sequence] " to represent the count for word sequence [word sequence] after normalization. In this example, we train the model with only one column, column "Text".

featurizer = NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'TfIdf'))

Then we can call .fit_transform() to train the featurizer.

text_transformed = featurizer.fit_transform(trainData["Text"].to_frame()) # Using one column as input
print(text_transformed.shape)
text_transformed.head(5)

(71, 1007)

	Text.Char.<␂>\|o\|h	Text.Char.o\|h\|<␠>	Text.Char.h\|<␠>\|y	Text.Char.<␠>\|y\|o	Text.Char.y\|o\|u	Text.Char.o\|u\|<␠>	Text.Char.u\|<␠>\|a	Text.Char.<␠>\|a\|r	Text.Char.a\|r\|e	Text.Char.r\|e\|<␠>	...
0	0.218218	0.218218	0.218218	0.218218	0.218218	0.218218	0.218218	0.218218	0.218218	0.218218	...
1	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...
2	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...
3	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...
4	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...

5 rows × 1007 columns

We can see that, all the columns are the generated features from the original "Text" column. Based on those features, we can train a binary classifier.

Binary Classifier

The user can use the transformed data as the input to the binary classifier using .fit(X,Y).

                                        ag = AveragedPerceptronBinaryClassifier()
                                        ag.fit(text_transformed, 1 * (trainData["Sentiment"] == "Positive"))

The user can also use NimbusML pipeline to train the featurizer and the learner together.

t0 = time.time()

ppl = Pipeline([
                NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'Tf')), 
                PcaTransformer(rank = 100),
                AveragedPerceptronBinaryClassifier(l2_regularization=0.4,
                                                   number_of_iterations=5),
               ])

ppl.fit(trainData["Text"], trainData["Label"]) #will replace with series if supported

print("Training time: "  + str(round(time.time() - t0, 2)))

Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Training calibrator.
Elapsed time: 00:00:00.7693293
Training time: 1.0

Using the NimbusML pipeline, we can call ppl.test(test_X,test_Y)

metrics, scores = ppl.test(testData["Text"], testData["Label"], output_scores = True) #replace with series 
print("Performance metrics: ")
display(metrics)
print("Individual scores: ")

# Append origin text to the score
scores["OriginText"] = testData["Text"]
scores["Sentiment"] = testData["Sentiment"]

display(scores[0:5])
print("Total runtime: "  + str(round(time.time() - t0, 2)))

Performance metrics:

	AUC	Accuracy	Positive precision	Positive recall	Negative precision	Negative recall	Log-loss	Log-loss reduction	Test-set entropy (prior Log-Loss/instance)	F1 Score	AUPRC
0	0.621295	0.662791	0	0	0.662791	1	1.006758	-0.091782	0.922123	NaN	0.529486

Individual scores:

	Score	Probability	OriginText	Sentiment
0	-0.128473	0.135174	@faketwitterid I am sad	Negative
1	-0.125328	0.149867	@wakeup_you It is a very simple twit I created	Negative
2	-0.114177	0.212648	@anotherfakeid I would love to see the latest ...	Positive
3	-0.164016	0.038579	Oh my ladygaga! I haven't played tennis for 2 ...	Negative
4	-0.091164	0.394444	I am heading on a road trip and taking a few d...	Positive

Total runtime: 4.03

Sentiment Analysis 1 - Data Loading with Pandas

Loading Data

Preprocessing

Binary Classifier

Additional resources