Tutorial: Use ML.NET in a sentiment analysis binary classification scenario

Note

This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, visit the ML.NET introduction.

This sample tutorial illustrates using ML.NET to create a sentiment classifier via a .NET Core console application using C# in Visual Studio 2017.

In this tutorial, you learn how to:

  • Understand the problem
  • Select the appropriate machine learning task
  • Prepare your data
  • Create the learning pipeline
  • Load a classifier
  • Train the model
  • Evaluate the model with a different dataset
  • Predict a single instance of test data outcome with the model
  • Predict the test data outcomes with a loaded model

Sentiment analysis sample overview

The sample is a console app that uses ML.NET to train a model that classifies and predicts sentiment as either positive or negative. It also evaluates the model with a second dataset for quality analysis. The sentiment datasets are from the WikiDetox project.

Prerequisites

Machine learning workflow

This tutorial follows a machine learning workflow that enables the process to move in an orderly fashion.

The workflow phases are as follows:

  1. Understand the problem
  2. Prepare your data
    • Load the data
    • Extract features (Transform your data)
  3. Build and train
    • Train the model
    • Evaluate the model
  4. Run
    • Model consumption

Understand the problem

You first need to understand the problem, so you can break it down to parts that can support building and training the model. Breaking the problem down allows you to predict and evaluate the results.

The problem for this tutorial is to understand incoming website comment sentiment to take the appropriate action.

You can break down the problem to the sentiment text and sentiment value for the data you want to train the model with, and a predicted sentiment value that you can evaluate and then use operationally.

You then need to determine the sentiment, which helps you with the machine learning task selection.

Select the appropriate machine learning task

With this problem, you know the following facts:

Training data: website comments can be toxic (1) or not toxic (0) (sentiment). Predict the sentiment of a new website comment, either toxic or not toxic, such as in the following examples:

  • Please refrain from adding nonsense to Wikipedia.
  • He is the best, and the article should say that.

The classification machine learning task is best suited for this scenario.

About the classification task

Classification is a machine learning task that uses data to determine the category, type, or class of an item or row of data. For example, you can use classification to:

  • Identify sentiment as positive or negative.
  • Classify email as spam, junk, or good.
  • Determine whether a patient's lab sample is cancerous.
  • Categorize customers by their propensity to respond to a sales campaign.

Classification tasks are frequently one of the following types:

  • Binary: either A or B.
  • Multiclass: multiple categories that can be predicted by using a single model.

Create a console application

  1. Open Visual Studio 2017. Select File > New > Project from the menu bar. In the New Project dialog, select the Visual C# node followed by the .NET Core node. Then select the Console App (.NET Core) project template. In the Name text box, type "SentimentAnalysis" and then select the OK button.

  2. Create a directory named Data in your project to save your data set files:

    In Solution Explorer, right-click on your project and select Add > New Folder. Type "Data" and hit Enter.

  3. Install the Microsoft.ML NuGet Package:

    In Solution Explorer, right-click on your project and select Manage NuGet Packages. Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML, select that package in the list, and select the Install button. Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.

Prepare your data

  1. Download the WikiPedia detox-250-line-data.tsv and the wikipedia-detox-250-line-test.tsv data sets and save them to the Data folder previously created. The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.

  2. In Solution Explorer, right-click each of the *.tsv files and select Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.

Create classes and define paths

Add the following additional using statements to the top of the Program.cs file:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms.Text;

You need to create three global fields to hold the paths to the recently downloaded files, and a global variable for the TextLoader:

  • _trainDataPath has the path to the dataset used to train the model.
  • _testDataPath has the path to the dataset used to evaluate the model.
  • _modelPath has the path where the trained model is saved.
  • _textLoader is the TextLoader used to load and transform the datasets.

Add the following code to the line right above the Main method to specify those paths and the _textLoader variable:

static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-data.tsv");
static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-test.tsv");
static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "Model.zip");
static TextLoader _textLoader;

You need to create some classes for your input data and predictions. Add a new class to your project:

  1. In Solution Explorer, right-click the project, and then select Add > New Item.

  2. In the Add New Item dialog box, select Class and change the Name field to SentimentData.cs. Then, select the Add button.

    The SentimentData.cs file opens in the code editor. Add the following using statement to the top of SentimentData.cs:

using Microsoft.ML.Data;

Remove the existing class definition and add the following code, which has two classes SentimentData and SentimentPrediction, to the SentimentData.cs file:

public class SentimentData
{
    [Column(ordinal: "0", name: "Label")]
    public float Sentiment;
    [Column(ordinal: "1")]
    public string SentimentText;
}

public class SentimentPrediction
{
    [ColumnName("PredictedLabel")]
    public bool Prediction { get; set; }

    [ColumnName("Probability")]
    public float Probability { get; set; }

    [ColumnName("Score")]
    public float Score { get; set; }
}

SentimentData is the input dataset class and has a float (Sentiment) that has a value for sentiment of either positive or negative, and a string for the comment (SentimentText). Both fields have Column attributes attached to them. This attribute describes the order of each field in the data file, and which is the Label field. SentimentPrediction is the class used for prediction after the model has been trained. It has a single boolean (Sentiment) and a PredictedLabel ColumnName attribute. The Label is used to create and train the model, and it's also used with a second dataset to evaluate the model. The PredictedLabel is used during prediction and evaluation. For evaluation, an input with training data, the predicted values, and the model are used.

When building a model with ML.NET you start by creating an MLContext. This is comparable conceptually to using DbContext in Entity Framework. The environment provides a context for your ML job that can be used for exception tracking and logging.

Initialize variables in Main

Create a variable called mlContext and initialize it with a new instance of MLContext. Replace the Console.WriteLine("Hello World!") line with the following code in the Main method:

MLContext mlContext = new MLContext(seed: 0);

Next, to setup for data loading initialize the _textLoader global variable in order to reuse it. Notice that you're using a TextReader. When you create a TextLoader using a TextReader, you pass in the context needed and the TextLoader.Arguments class which enables customization.

Specify the data schema by passing an array of TextLoader.Column objects to the loader containing all the column names and their types. You defined the data schema previously when you created our SentimentData class. For our schema, the first column (Label) is a Boolean (the prediction) and the second column (SentimentText) is the feature of type text/string used for predicting the sentiment. The TextReader class returns a fully initialized TextLoader

To initialize the _textLoader global variable in order to reuse it for the needed datasets, add the following code after the mlContext initialization:

_textLoader = mlContext.Data.CreateTextReader(new TextLoader.Arguments()
                                    {
                                        Separator = "tab",
                                        HasHeader = true,
                                        Column = new[]
                                                    {
                                                      new TextLoader.Column("Label", DataKind.Bool, 0),
                                                      new TextLoader.Column("SentimentText", DataKind.Text, 1)
                                                    }
                                    }
);

Add the following as the next line of code in the Main method:

var model = Train(mlContext, _trainDataPath);

The Train method executes the following tasks:

  • Loads the data.
  • Extracts and transforms the data.
  • Trains the model.
  • Predicts sentiment based on test data.
  • Returns the model.

Create the Train method, just after the Main method, using the following code:

 public static ITransformer Train(MLContext mlContext, string dataPath)
{

}

Notice that two parameters are passed into the Train method; a MLContext for the context (mlContext), and a String for the dataset path (dataPath). You're going to use this method more than once for training and testing.

Load the data

You'll load the data using the _textLoader global variable with the dataPath parameter. It returns a IDataView. As the input and output of Transforms, a DataView is the fundamental data pipeline type, comparable to IEnumerable for LINQ.

In ML.NET, data is similar to a SQL view. It is lazily evaluated, schematized, and heterogenous. The object is the first part of the pipeline, and loads the data. For this tutorial, it loads a dataset with comments and corresponding toxic or non toxic sentiment. This is used to create the model, and train it.

Add the following code as the first line of the Train method:

IDataView dataView =_textLoader.Read(dataPath);

Extract and transform the data

Pre-processing and cleaning data are important tasks that occur before a dataset is used effectively for machine learning. Raw data is often noisy and unreliable, and may be missing values. Using data without these modeling tasks can produce misleading results.

ML.NET's transform pipelines compose a custom set of transforms that are applied to your data before training or testing. The transforms' primary purpose is data featurization. Machine learning algorithms understand featurized data, so the next step is to transform our textual data into a format that our ML algorithms recognize. That format is a numeric vector.

Next, call mlContext.Transforms.Text.FeaturizeText which featurizes the text column (SentimentText) column into a numeric vector called Features used by the machine learning algorithm. This is a wrapper call that returns an EstimatorChain<TLastTransformer> that will effectively be a pipeline. Name this pipeline as you will then append the trainer to the EstimatorChain. Add this as the next line of code:

var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")

This is the preprocessing/featurization step. Using additional components available in ML.NET can enable better results with your model.

Choose a learning algorithm

To add the trainer, call the mlContext.Transforms.Text.FeaturizeText wrapper method which returns a FastTreeBinaryClassificationTrainer object. This is a decision tree learner you'll use in this pipeline. The FastTreeBinaryClassificationTrainer is appended to the pipeline and accepts the featurized SentimentText (Features) and the Label input parameters to learn from the historic data.

Add the following code to the Train method:

.Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));

Train the model

You train the model, TransformerChain<TLastTransformer>, based on the dataset that has been loaded and transformed. Once the estimator has been defined, you train your model using the Fit while providing the already loaded training data. This returns a model to use for predictions. pipeline.Fit() trains the pipeline and returns a Transformer based on the DataView passed in. The experiment is not executed until this happens.

Add the following code to the Train method:

Console.WriteLine("=============== Create and Train the Model ===============");
var model = pipeline.Fit(dataView);
Console.WriteLine("=============== End of training ===============");
Console.WriteLine();

Save and Return the model trained to use for evaluation

At this point, you have a model of type TransformerChain<TLastTransformer> that can be integrated into any of your existing or new .NET applications. Return the model at the end of the Train method.

return model;

Evaluate the model

Now that you've created and trained the model, you need to evaluate it with a different dataset for quality assurance and validation. In the Evaluate method, the model created in Train is passed in to be evaluated. Create the Evaluate method, just after Train, as in the following code:

public static void Evaluate(MLContext mlContext, ITransformer model)
{

}

The Evaluate method executes the following tasks:

  • Loads the test dataset.
  • Creates the binary evaluator.
  • Evaluates the model and create metrics.
  • Displays the metrics.

Add a call to the new method from the Main method, right under the Train method call, using the following code:

Evaluate(mlContext, model);

You'll load the test dataset using the previously initialized _textLoader global variable with the _testDataPath global field. You can evaluate the model using this dataset as a quality check. Add the following code to the Evaluate method:

IDataView dataView = _textLoader.Read(_testDataPath);

Next, you'll use the machine learning model parameter (a transformer) to input the features and return predictions. Add the following code to the Evaluate method as the next line:

Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
var predictions = model.Transform(dataView);

The BinaryClassificationContext.Evaluate method computes the quality metrics for the PredictionModel using the specified dataset. It returns a BinaryClassificationEvaluator.CalibratedResult object contains the overall metrics computed by binary classification evaluators. To display these to determine the quality of the model, you need to get the metrics first. Add the following code as the next line in the Evaluate method:

var metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");

Displaying the metrics for model validation

Use the following code to display the metrics, share the results, and then act on them:

Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");

To save your model to a .zip file before returning, add the following code to call the SaveModelAsFile method as the next line in Evaluate:

SaveModelAsFile(mlContext, model);

Save the model as a.zip file

Create the SaveModelAsFile method, just after the Evaluate method, using the following code:

private static void SaveModelAsFile(MLContext mlContext, ITransformer model)
{

}

The SaveModelAsFile method executes the following tasks:

  • Saves the model as a .zip file.

Next, create a method to save the model so that it can be reused and consumed in other applications. The ITransformer has a SaveTo(IHostEnvironment, Stream) method that takes in the _modelPath global field, and a Stream. To save this as a zip file, you'll create the FileStream immediately before calling the SaveTo method. Add the following code to the SaveModelAsFile method as the next line:

using (var fs = new FileStream(_modelPath, FileMode.Create, FileAccess.Write, FileShare.Write))
    mlContext.Model.Save(model,fs);

You could also display where the file was written by writing a console message with the _modelPath, using the following code:

Console.WriteLine("The model is saved to {0}", _modelPath);

Predict the test data outcome with the model and a single comment

Create the Predict method, just after the Evaluate method, using the following code:

private static void Predict(MLContext mlContext, ITransformer model)
{

}

The Predict method executes the following tasks:

  • Creates a single comment of test data.
  • Predicts sentiment based on test data.
  • Combines test data and predictions for reporting.
  • Displays the predicted results.

Add a call to the new method from the Main method, right under the Evaluate method call, using the following code:

Predict(mlContext, model);

While the model is a transformer that operates on many rows of data, a very common production scenario is a need for predictions on individual examples. The PredictionEngine<TSrc,TDst> is a wrapper that is returned from the CreatePredictionEngine method. Let's add the following code to create the PredictionFunction as the first line in the Predict Method:

var predictionFunction = model.CreatePredictionEngine<SentimentData, SentimentPrediction>(mlContext);

Add a comment to test the trained model's prediction in the Predict method by creating an instance of SentimentData:

SentimentData sampleStatement = new SentimentData
{
    SentimentText = "This is a very rude movie"
};

You can use that to predict the Toxic or Non Toxic sentiment of a single instance of the comment data. To get a prediction, use Predict on the data. Note that the input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions.

var resultprediction = predictionFunction.Predict(sampleStatement);

Model operationalization: prediction

Display SentimentText and corresponding sentiment prediction in order to share the results and act on them accordingly. This is called operationalization, using the returned data as part of the operational policies. Create a display for the results using the following Console.WriteLine() code:

Console.WriteLine();
Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");

Console.WriteLine();
Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {resultprediction.Probability} ");
Console.WriteLine("=============== End of Predictions ===============");
Console.WriteLine();

Predict the test data outcomes with the saved model

Create the PredictWithModelLoadedFromFile method, just before the SaveModelAsFile method, using the following code:

public static void PredictWithModelLoadedFromFile(MLContext mlContext)
{

}

The PredictWithModelLoadedFromFile method executes the following tasks:

  • Creates batch test data.
  • Predicts sentiment based on test data.
  • Combines test data and predictions for reporting.
  • Displays the predicted results.

Add a call to the new method from the Main method, right under the Predict method call, using the following code:

PredictWithModelLoadedFromFile(mlContext);

Add some comments to test the trained model's predictions in the PredictWithModelLoadedFromFile method:

IEnumerable<SentimentData> sentiments = new[]
{
    new SentimentData
    {
        SentimentText = "This is a very rude movie"
    },
    new SentimentData
    {
        SentimentText = "He is the best, and the article should say that."
    }
};

Load the model

ITransformer loadedModel;
using (var stream = new FileStream(_modelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
    loadedModel = mlContext.Model.Load(stream);
}

Now that you have a model, you can use that to predict the Toxic or Non Toxic sentiment of the comment data using the Transform(IDataView) method. To get a prediction, use Predict on new data. Note that the input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions. Add the following code to the PredictWithModelLoadedFromFile method for the predictions:

// Create prediction engine
var sentimentStreamingDataView = mlContext.CreateStreamingDataView(sentiments);
var predictions = loadedModel.Transform(sentimentStreamingDataView);

// Use the model to predict whether comment data is toxic (1) or nice (0).
var predictedResults = predictions.AsEnumerable<SentimentPrediction>(mlContext, reuseRowObject: false);

Model operationalization: prediction

Display SentimentText and corresponding sentiment prediction in order to share the results and act on them accordingly. This is called operationalization, using the returned data as part of the operational policies. Create a header for the results using the following Console.WriteLine() code:

Console.WriteLine();

Console.WriteLine("=============== Prediction Test of loaded model with a multiple samples ===============");

Before displaying the predicted results, combine the sentiment and prediction together to see the original comment with its predicted sentiment. The following code uses the Zip method to make that happen, so add that code next:

var sentimentsAndPredictions = sentiments.Zip(predictedResults, (sentiment, prediction) => (sentiment, prediction));

Now that you've combined the SentimentText and Sentiment into a class, you can display the results using the Console.WriteLine() method:

foreach (var item in sentimentsAndPredictions)
{
    Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(Convert.ToBoolean(item.prediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {item.prediction.Probability} ");
}
Console.WriteLine("=============== End of predictions ===============");

Because inferred tuple element names are a new feature in C# 7.1 and the default language version of the project is C# 7.0, you need to change the language version to C# 7.1 or higher. To do that, right-click on the project node in Solution Explorer and select Properties. Select the Build tab and select the Advanced button. In the dropdown, select C# 7.1 (or a higher version). Select the OK button.

Results

Your results should be similar to the following. As the pipeline processes, it displays messages. You may see warnings, or processing messages. These have been removed from the following results for clarity.

Model quality metrics evaluation
--------------------------------
Accuracy: 94.44%
Auc: 98.77%
F1Score: 94.74%
=============== End of model evaluation ===============

=============== Prediction Test of model with a single sample and test dataset ===============

Sentiment: This is a very rude movie | Prediction: Toxic | Probability: 0.5297049
=============== End of Predictions ===============

=============== New iteration of Model ===============
=============== Create and Train the Model ===============
=============== End of training ===============


The model is saved to: C:\Tutorial\SentimentAnalysis\bin\Debug\netcoreapp2.0\Data\Model.zip

=============== Prediction Test of loaded model with a multiple sample ===============

Sentiment: This is a very rude movie | Prediction: Toxic | Probability: 0.4585565
Sentiment: He is the best, and the article should say that. | Prediction: Not Toxic | Probability: 0.9924279

Congratulations! You've now successfully built a machine learning model for classifying and predicting messages sentiment. You can find the source code for this tutorial at the dotnet/samples repository.

Next steps

In this tutorial, you learned how to:

  • Understand the problem
  • Select the appropriate machine learning task
  • Prepare your data
  • Create the learning pipeline
  • Load a classifier
  • Train the model
  • Evaluate the model with a different dataset
  • Predict the test data outcomes with the model

Advance to the next tutorial to learn more