Tutorial: Use ML.NET in a sentiment analysis binary classification scenario

This sample tutorial illustrates using ML.NET to create a sentiment classifier to predict either positive or negative sentiment via a .NET Core console application using C# in Visual Studio 2017. In the world of machine learning, this type of prediction is known as binary classification.

Note

This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit the ML.NET introduction.

This tutorial and related sample are currently using ML.NET version 0.11. For more information, see the release notes at the dotnet/machinelearning GitHub repo

In this tutorial, you learn how to:

  • Understand the problem
  • Select the appropriate machine learning algorithm
  • Prepare your data
  • Transform the data
  • Train the model
  • Evaluate the model
  • Predict with the trained model
  • Deploy and Predict with a loaded model

Sentiment analysis sample overview

The sample is a console app that uses ML.NET to train a model that classifies and predicts sentiment as either positive or negative. The Yelp sentiment dataset is from University of California, Irvine (UCI), which is split into a train dataset and a test dataset. The sample evaluates the model with the test dataset for quality analysis.

You can find the source code for this tutorial at the dotnet/samples repository.

Prerequisites

Machine learning workflow

This tutorial follows a machine learning workflow that enables the process to move in an orderly fashion.

The workflow phases are as follows:

  1. Understand the problem
  2. Prepare your data
    • Load the data
    • Extract features (Transform your data)
  3. Build and train
    • Train the model
    • Evaluate the model
  4. Deploy Model
    • Use the Model to predict

Understand the problem

You first need to understand the problem, so you can break it down to parts that can support building and training the model. Breaking the problem down allows you to predict and evaluate the results.

The problem for this tutorial is to understand incoming website comment sentiment to take the appropriate action.

You can break down the problem to the sentiment text and sentiment value for the data you want to train the model with, and a predicted sentiment value that you can evaluate and then use operationally.

You then need to determine the sentiment, which helps you with the machine learning task selection.

Select the appropriate machine learning algorithm

With this problem, you know the following facts:

Training data: website comments can be positive (1) or negative (0) (sentiment).

Predict the sentiment of a new website comment, either positive or negative, such as in the following examples:

  • I love the wait staff here. They rock.
  • This place has the worst soup.

The classification machine learning algorithm is best suited for this scenario.

About the classification algorithm

Classification is a machine learning algorithm that uses data to determine the category, type, or class of an item or row of data. For example, you can use classification to:

  • Identify sentiment as positive or negative.
  • Classify email as spam, junk, or good.
  • Determine whether a patient's lab sample is cancerous.
  • Categorize customers by their propensity to respond to a sales campaign.

Classification algorithms are frequently one of the following types:

  • Binary: either A or B.
  • Multiclass: multiple categories that can be predicted by using a single model.

Because the website comments need to be classified as either positive or negative, you use the Binary Classification algorithm.

Create a console application

  1. Open Visual Studio 2017. Select File > New > Project from the menu bar. In the New Project dialog, select the Visual C# node followed by the .NET Core node. Then select the Console App (.NET Core) project template. In the Name text box, type "SentimentAnalysis" and then select the OK button.

  2. Create a directory named Data in your project to save your data set files:

    In Solution Explorer, right-click on your project and select Add > New Folder. Type "Data" and hit Enter.

  3. Install the Microsoft.ML NuGet Package:

    In Solution Explorer, right-click on your project and select Manage NuGet Packages. Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML, select that package in the list, and select the Install button. Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.

Prepare your data

  1. Download The UCI Sentiment Labeled Sentences dataset zip file (see citations in the following note), and unzip.

  2. Copy the yelp_labelled.txt file into the Data directory you created.

Note

The datasets this tutorial uses are from the 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015, and hosted at the UCI Machine Learning Repository - Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

  1. In Solution Explorer, right-click the yelp_labeled.txt file and select Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.

Create classes and define paths

Add the following additional using statements to the top of the Program.cs file:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.Data.DataView;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms.Text;

You need to create two global fields to hold the recently downloaded dataset file path and the saved model file path:

  • _dataPath has the path to the dataset used to train the model.
  • _modelPath has the path where the trained model is saved.

Add the following code to the line right above the Main method to specify those paths:

static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "yelp_labelled.txt");
static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "Model.zip");

You need to create some classes for your input data and predictions. Add a new class to your project:

  1. In Solution Explorer, right-click the project, and then select Add > New Item.

  2. In the Add New Item dialog box, select Class and change the Name field to SentimentData.cs. Then, select the Add button.

    The SentimentData.cs file opens in the code editor. Add the following using statement to the top of SentimentData.cs:

using Microsoft.ML.Data;

Remove the existing class definition and add the following code, which has two classes SentimentData and SentimentPrediction, to the SentimentData.cs file:

public class SentimentData
{
    [LoadColumn(0)]
    public string SentimentText;

    [LoadColumn(1), ColumnName("Label")]
    public bool Sentiment;
}

public class SentimentPrediction
{
    [ColumnName("PredictedLabel")]
    public bool Prediction { get; set; }

   // [ColumnName("Probability")]
    public float Probability { get; set; }

  //  [ColumnName("Score")]
    public float Score { get; set; }
}

The input dataset class, SentimentData, has a string for the comment (SentimentText) and a bool (Sentiment) that has a value for sentiment of either positive or negative. Both fields have LoadColumnAttribute(Int32) attributes attached to them. This attribute describes the order of each field in the data file. In addition, the Sentiment property has a ColumnNameAttribute to designate it as the Label field. SentimentPrediction is the class used for prediction after the model has been trained. It has a single boolean (Sentiment) and a PredictedLabel ColumnName attribute. The Label is used to create and train the model, and it's also used with the split out test dataset to evaluate the model. The PredictedLabel is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.

When building a model with ML.NET you start by creating an MLContext. MLContext is comparable conceptually to using DbContext in Entity Framework. The environment provides a context for your ML job that can be used for exception tracking and logging.

Initialize variables in Main

Create a variable called mlContext and initialize it with a new instance of MLContext. Replace the Console.WriteLine("Hello World!") line with the following code in the Main method:

MLContext mlContext = new MLContext();

Add the following as the next line of code in the Main method:

TrainCatalogBase.TrainTestData splitDataView = LoadData(mlContext);

The LoadData method executes the following tasks:

  • Loads the data.
  • Splits the loaded dataset into train and test datasets.
  • Returns the split train and test datasets.

Create the LoadData method, just after the Main method, using the following code:

public static TrainCatalogBase.TrainTestData LoadData(MLContext mlContext)
{

}

Load the data

Since your previously created SentimentData data model type matches the dataset schema, you can combine the initialization, mapping, and dataset loading into one line of code using the MLContext.Data.LoadFromTextFile wrapper for the LoadFromTextFile method. It returns a IDataView.

As the input and output of Transforms, a DataView is the fundamental data pipeline type, comparable to IEnumerable for LINQ.

In ML.NET, data is similar to a SQL view. It is lazily evaluated, schematized, and heterogenous. The object is the first part of the pipeline, and loads the data. For this tutorial, it loads a dataset with comments and corresponding toxic or non toxic sentiment. This is used to create the model, and train it.

Add the following code as the first line of the LoadData method:

IDataView dataView = mlContext.Data.LoadFromTextFile<SentimentData>(_dataPath,hasHeader:false);

Split the dataset for model training and testing

Next, you need both a training dataset to train the model and a test dataset to evaluate the model. Use the MLContext.BinaryClassification.TrainTestSplit which wraps TrainTestSplit to split the loaded dataset into train and test datasets and return them inside of a TrainCatalogBase.TrainTestData. You can specify the fraction of data for the test set with the testFractionparameter. The default is 10% but you use 20% in this case to use more data for the evaluation.

To split the loaded data into the needed datasets, add the following code as the next line in the LoadData method:

TrainCatalogBase.TrainTestData splitDataView = mlContext.BinaryClassification.TrainTestSplit(dataView, testFraction: 0.2);

Return the splitDataView at the end of the LoadData method:

return splitDataView;

Build and train the model

Add the following call to the BuildAndTrainModelmethod as the next line of code in the Main method:

ITransformer model = BuildAndTrainModel(mlContext, splitDataView.TrainSet);

The BuildAndTrainModel method executes the following tasks:

  • Extracts and transforms the data.
  • Trains the model.
  • Predicts sentiment based on test data.
  • Returns the model.

Create the BuildAndTrainModel method, just after the Main method, using the following code:

public static ITransformer BuildAndTrainModel(MLContext mlContext, IDataView splitTrainSet)
{

}

Notice that two parameters are passed into the Train method; a MLContext for the context (mlContext), and an IDataViewfor the training dataset (splitTrainSet).

Extract and transform the data

Pre-processing and cleaning data are important tasks that occur before a dataset is used effectively for machine learning. Raw data is often noisy and unreliable, and may be missing values. Using data without these modeling tasks can produce misleading results.

ML.NET's transform pipelines compose a custom set of transforms that are applied to your data before training or testing. The transforms' primary purpose is data featurization. Machine learning algorithms understand featurized data, so the next step is to transform our textual data into a format that our ML algorithms recognize. That format is a numeric vector.

Next, call mlContext.Transforms.Text.FeaturizeText which featurizes the text column (SentimentText) column into a numeric vector called Features used by the machine learning algorithm. This is a wrapper call that returns an EstimatorChain<TLastTransformer> that will effectively be a pipeline. Name this pipeline as you will then append the trainer to the EstimatorChain. Add this as the next line of code:

var pipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: DefaultColumnNames.Features, inputColumnName: nameof(SentimentData.SentimentText))

Warning

ML.NET Version 0.10 changed the order of the Transform parameters. This will not error out until you run the application and build the model. Use the parameter names for Transforms as illustrated in the previous code snippet.

This is the preprocessing/featurization step. Using additional components available in ML.NET can enable better results with your model.

Choose a learning algorithm

To add the trainer, call the mlContext.BinaryClassification.Trainers.FastTree wrapper method which returns a FastTreeBinaryClassificationTrainer object. This is a decision tree learner you'll use in this pipeline. The FastTreeBinaryClassificationTrainer is appended to the pipeline and accepts the featurized SentimentText (Features) and the Label input parameters to learn from the historic data.

Add the following code to the BuildAndTrainModel method:

.Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));

Train the model

You train the model, TransformerChain<TLastTransformer>, based on the dataset that has been loaded and transformed. Once the estimator has been defined, you train your model using the Fit method while providing the already loaded training data. This returns a model to use for predictions. pipeline.Fit() trains the pipeline and returns a Transformer based on the DataView passed in. The experiment is not executed until the .Fit() method runs.

Add the following code to the BuildAndTrainModel method:

Console.WriteLine("=============== Create and Train the Model ===============");
var model = pipeline.Fit(splitTrainSet);
Console.WriteLine("=============== End of training ===============");
Console.WriteLine();

Save and Return the model trained to use for evaluation

At this point, you have a model of type TransformerChain<TLastTransformer> that can be integrated into any of your existing or new .NET applications. Return the model at the end of the BuildAndTrainModel method.

return model;

Evaluate the model

Now that you've created and trained the model, you need to evaluate it with a different dataset for quality assurance and validation. In the Evaluate method, the model created in BuildAndTrainModel is passed in to be evaluated. Create the Evaluate method, just after BuildAndTrainModel, as in the following code:

public static void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet)
{

}

The Evaluate method executes the following tasks:

  • Loads the test dataset.
  • Creates the binaryclassification evaluator.
  • Evaluates the model and creates metrics.
  • Displays the metrics.

Add a call to the new method from the Main method, right under the Train method call, using the following code:

Evaluate(mlContext, model, splitDataView.TestSet);

Next, you'll use the machine learning model parameter (a transformer) and the splitTestSet parameter to input the features and return predictions. Add the following code to the Evaluate method as the next line:

Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
IDataView predictions = model.Transform(splitTestSet);

The mlContext.BinaryClassification.Evaluate method computes the quality metrics for the PredictionModel using the specified dataset. It returns a CalibratedBinaryClassificationMetrics object that contains the overall metrics computed by binary classification evaluators. To display these to determine the quality of the model, you need to get the metrics first. Add the following code as the next line in the Evaluate method:

CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");

Displaying the metrics for model validation

Use the following code to display the metrics, share the results, and then act on them:

Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");

To save your model to a .zip file before returning, add the following code to call the SaveModelAsFile method as the next line in Evaluate:

SaveModelAsFile(mlContext, model);

Save the model as a.zip file

Create the SaveModelAsFile method, just after the Evaluate method, using the following code:

private static void SaveModelAsFile(MLContext mlContext, ITransformer model)
{

}

The SaveModelAsFile method executes the following tasks:

  • Saves the model as a .zip file.

Next, create a method to save the model so that it can be reused and consumed in other applications. The ITransformer has a SaveTo(IHostEnvironment, Stream) method that takes in the _modelPath global field, and a Stream. To save this as a zip file, you'll create the FileStream immediately before calling the SaveTo method. Add the following code to the SaveModelAsFile method as the next line:

using (var fs = new FileStream(_modelPath, FileMode.Create, FileAccess.Write, FileShare.Write))
    mlContext.Model.Save(model, fs);

You could also display where the file was written by writing a console message with the _modelPath, using the following code:

Console.WriteLine("The model is saved to {0}", _modelPath);

Predict the test data outcome with the saved model

Create the UseModelWithSingleItem method, just after the Evaluate method, using the following code:

private static void UseModelWithSingleItem(MLContext mlContext, ITransformer model)
{

}

The UseModelWithSingleItem method executes the following tasks:

  • Creates a single comment of test data.
  • Predicts sentiment based on test data.
  • Combines test data and predictions for reporting.
  • Displays the predicted results.

Add a call to the new method from the Main method, right under the Evaluate method call, using the following code:

UseModelWithSingleItem(mlContext, model);

While the model is a transformer that operates on many rows of data, a very common production scenario is a need for predictions on individual examples. The PredictionEngine<TSrc,TDst> is a wrapper that is returned from the CreatePredictionEngine method. Let's add the following code to create the PredictionEngine as the first line in the Predict Method:

PredictionEngine<SentimentData, SentimentPrediction> predictionFunction = model.CreatePredictionEngine<SentimentData, SentimentPrediction>(mlContext);

Add a comment to test the trained model's prediction in the Predict method by creating an instance of SentimentData:

SentimentData sampleStatement = new SentimentData
{
    SentimentText = "This was a very bad steak"
};

You can use that to predict the positive or negative sentiment of a single instance of the comment data. To get a prediction, use Predict on the data. Note that the input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions.

var resultprediction = predictionFunction.Predict(sampleStatement);

Use the model: prediction

Display SentimentText and corresponding sentiment prediction in order to share the results and act on them accordingly. This is called operationalization, using the returned data as part of the operational policies. Create a display for the results using the following Console.WriteLine() code:

Console.WriteLine();
Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");

Console.WriteLine();
Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Positive" : "Negative")} | Probability: {resultprediction.Probability} ");

Console.WriteLine("=============== End of Predictions ===============");
Console.WriteLine();

Deploy and Predict with a loaded model

Create the UseLoadedModelWithBatchItems method, just before the SaveModelAsFile method, using the following code:

public static void UseLoadedModelWithBatchItems(MLContext mlContext)
{

}

The UseLoadedModelWithBatchItems method executes the following tasks:

  • Creates batch test data.
  • Predicts sentiment based on test data.
  • Combines test data and predictions for reporting.
  • Displays the predicted results.

Add a call to the new method from the Main method, right under the UseModelWithSingleItem method call, using the following code:

UseLoadedModelWithBatchItems(mlContext);

Add some comments to test the trained model's predictions in the UseLoadedModelWithBatchItems method:

IEnumerable<SentimentData> sentiments = new[]
{
    new SentimentData
    {
        SentimentText = "This was a horrible meal"
    },
    new SentimentData
    {
        SentimentText = "I love this spaghetti."
    }
};

Load the model

ITransformer loadedModel;
using (var stream = new FileStream(_modelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
    loadedModel = mlContext.Model.Load(stream);
}

Now that you have a model, you can use that to predict the Toxic or Non Toxic sentiment of the comment data using the Transform method. To get a prediction, use Predict on new data. Note that the input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions. Add the following code to the UseLoadedModelWithBatchItems method for the predictions:

IDataView sentimentStreamingDataView = mlContext.Data.LoadFromEnumerable(sentiments);

IDataView predictions = loadedModel.Transform(sentimentStreamingDataView);

// Use model to predict whether comment data is Positive (1) or Negative (0).
IEnumerable<SentimentPrediction> predictedResults = mlContext.Data.CreateEnumerable<SentimentPrediction>(predictions, reuseRowObject: false);

Use the loaded model for prediction

Display SentimentText and corresponding sentiment prediction in order to share the results and act on them accordingly. This is called operationalization, using the returned data as part of the operational policies. Create a header for the results using the following Console.WriteLine() code:

Console.WriteLine();

Console.WriteLine("=============== Prediction Test of loaded model with a multiple samples ===============");

Before displaying the predicted results, combine the sentiment and prediction together to see the original comment with its predicted sentiment. The following code uses the Zip method to make that happen, so add that code next:

IEnumerable<(SentimentData sentiment, SentimentPrediction prediction)> sentimentsAndPredictions = sentiments.Zip(predictedResults, (sentiment, prediction) => (sentiment, prediction));

Now that you've combined the SentimentText and Sentiment into a class, you can display the results using the Console.WriteLine() method:

foreach ((SentimentData sentiment, SentimentPrediction prediction) item in sentimentsAndPredictions)
{
    Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(Convert.ToBoolean(item.prediction.Prediction) ? "Positive" : "Negative")} | Probability: {item.prediction.Probability} ");

}
Console.WriteLine("=============== End of predictions ===============");

Because inferred tuple element names are a new feature in C# 7.1 and the default language version of the project is C# 7.0, you need to change the language version to C# 7.1 or higher. To do that, right-click on the project node in Solution Explorer and select Properties. Select the Build tab and select the Advanced button. In the dropdown, select C# 7.1 (or a higher version). Select the OK button.

Results

Your results should be similar to the following. As the pipeline processes, it displays messages. You may see warnings, or processing messages. These have been removed from the following results for clarity.

Model quality metrics evaluation
--------------------------------
Accuracy: 79.14%
Auc: 86.27%
F1Score: 80.60%

=============== End of model evaluation ===============
The model is saved to C:\Tutorials\SentimentAnalysis\bin\Debug\netcoreapp2.1\Data\Model.zip

=============== Prediction Test of model with a single sample and test dataset ===============

Sentiment: This was a very bad steak | Prediction: Negative | Probability: 0.4641322
=============== End of Predictions ===============


=============== Prediction Test of loaded model with a multiple samples ===============

Sentiment: This was a horrible meal | Prediction: Negative | Probability: 0.1391833
Sentiment: I love this spaghetti. | Prediction: Positive | Probability: 0.9819039
=============== End of predictions ===============

=============== End of process ===============
Press any key to continue . . .

Congratulations! You've now successfully built a machine learning model for classifying and predicting messages sentiment.

Building successful models is an iterative process. This model has initial lower quality as the tutorial uses small datasets to provide quick model training. If you aren't satisfied with the model quality, you can try to improve it by providing larger training datasets or by choosing different training algorithms with different hyper-parameters for each algorithm.

You can find the source code for this tutorial at the dotnet/samples repository.

Next steps

In this tutorial, you learned how to:

  • Understand the problem
  • Select the appropriate machine learning algorithm
  • Prepare your data
  • Transform the data
  • Train the model
  • Evaluate the model
  • Predict with the trained model
  • Deploy and Predict with a loaded model

Advance to the next tutorial to learn more