Sentiment Analysis 1 - Data Loading with Pandas
In this example, we develop a binary classifier using the manually generated Twitter data to detect the sentiment of each tweet. For example, "This is awesome!" will be a positive one and "I am sad" will be negative. The input data is the text and we use NimbusML NGramFeaturizer to extract numeric features and input them to a AveragedPerceptron classifier.
Loading Data
import time
import pandas as pd
import numpy as np
import os
from IPython.display import display
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.text import NGramFeaturizer
from nimbusml.feature_extraction.text.extractor import Ngram
from nimbusml.linear_model import AveragedPerceptronBinaryClassifier
from nimbusml.decomposition import PcaTransformer
from nimbusml import Pipeline
# Load data from package
trainDataFile = get_dataset('gen_twittertrain').as_filepath()
testDataFile = get_dataset('gen_twittertest').as_filepath()
print("Train data file path: " + str(os.path.basename(trainDataFile)))
print("Test data file path: " + str(os.path.basename(testDataFile)))
trainData = pd.read_csv(trainDataFile, sep = "\t")
testData = pd.read_csv(testDataFile, sep = "\t")
trainData.head()
Train data file path: train-twitter.gen-sample.tsv
Test data file path: test-twitter.gen-sample.tsv
Sentiment | Text | Label | |
---|---|---|---|
0 | Negative | Oh you are hurting me | 0 |
1 | Positive | So long | 1 |
2 | Positive | Ths sofa is comfortable | 1 |
3 | Negative | The place suck. No? | 0 |
4 | Positive | @fakeid "Chillin" I love it!! | 1 |
We use the "Text" column as the input feature and the "Sentiment" column as the label column (after converting to numeric).
Preprocessing
The NGramFeaturizer transform produces a bag of counts of sequences of consecutive words, called n-grams, from a given corpus of text. The word counts are then normalized using term frequency-inverse document frequency (TF-IDF) method.
In NimbusML, the user can specify the input column names for each operator to be executed on. If not, all the columns from the previous operator or the origin dataset will be used. In Tutorial 2.2, the column syntax of nimbusml will be discussed in more details.
For text featurizer, since the output has multiple columns, for visualization, the names for those will become "output_col_name.[word sequence] " to represent the count for word sequence [word sequence] after normalization. In this example, we train the model with only one column, column "Text".
featurizer = NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'TfIdf'))
Then we can call .fit_transform() to train the featurizer.
text_transformed = featurizer.fit_transform(trainData["Text"].to_frame()) # Using one column as input
print(text_transformed.shape)
text_transformed.head(5)
(71, 1007)
Text.Char.<␂>|o|h | Text.Char.o|h|<␠> | Text.Char.h|<␠>|y | Text.Char.<␠>|y|o | Text.Char.y|o|u | Text.Char.o|u|<␠> | Text.Char.u|<␠>|a | Text.Char.<␠>|a|r | Text.Char.a|r|e | Text.Char.r|e|<␠> | ... | Text.Word.upset | Text.Word.vocation | Text.Word.flight | Text.Word.late | Text.Word.again | Text.Word.commute | Text.Word.died | Text.Word.cancer | Text.Word.oh, | Text.Word.finally | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.218218 | 0.218218 | 0.218218 | 0.218218 | 0.218218 | 0.218218 | 0.218218 | 0.218218 | 0.218218 | 0.218218 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 1007 columns
We can see that, all the columns are the generated features from the original "Text" column. Based on those features, we can train a binary classifier.
Binary Classifier
The user can use the transformed data as the input to the binary classifier using .fit(X,Y).
ag = AveragedPerceptronBinaryClassifier()
ag.fit(text_transformed, 1 * (trainData["Sentiment"] == "Positive"))
The user can also use NimbusML pipeline to train the featurizer and the learner together.
t0 = time.time()
ppl = Pipeline([
NGramFeaturizer(word_feature_extractor=Ngram(weighting = 'Tf')),
PcaTransformer(rank = 100),
AveragedPerceptronBinaryClassifier(l2_regularization=0.4,
number_of_iterations=5),
])
ppl.fit(trainData["Text"], trainData["Label"]) #will replace with series if supported
print("Training time: " + str(round(time.time() - t0, 2)))
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Training calibrator.
Elapsed time: 00:00:00.7693293
Training time: 1.0
Using the NimbusML pipeline, we can call ppl.test(test_X,test_Y)
metrics, scores = ppl.test(testData["Text"], testData["Label"], output_scores = True) #replace with series
print("Performance metrics: ")
display(metrics)
print("Individual scores: ")
# Append origin text to the score
scores["OriginText"] = testData["Text"]
scores["Sentiment"] = testData["Sentiment"]
display(scores[0:5])
print("Total runtime: " + str(round(time.time() - t0, 2)))
Performance metrics:
AUC | Accuracy | Positive precision | Positive recall | Negative precision | Negative recall | Log-loss | Log-loss reduction | Test-set entropy (prior Log-Loss/instance) | F1 Score | AUPRC | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.621295 | 0.662791 | 0 | 0 | 0.662791 | 1 | 1.006758 | -0.091782 | 0.922123 | NaN | 0.529486 |
Individual scores:
PredictedLabel | Score | Probability | OriginText | Sentiment | |
---|---|---|---|---|---|
0 | 0 | -0.128473 | 0.135174 | @faketwitterid I am sad | Negative |
1 | 0 | -0.125328 | 0.149867 | @wakeup_you It is a very simple twit I created | Negative |
2 | 0 | -0.114177 | 0.212648 | @anotherfakeid I would love to see the latest ... | Positive |
3 | 0 | -0.164016 | 0.038579 | Oh my ladygaga! I haven't played tennis for 2 ... | Negative |
4 | 0 | -0.091164 | 0.394444 | I am heading on a road trip and taking a few d... | Positive |
Total runtime: 4.03