Tutorial: Predict prices using a regression learner with ML.NET

This tutorial illustrates how to use ML.NET to build a regression model for predicting prices, specifically, New York City taxi fares.

Note

This topic refers to ML.NET, which is currently in Preview, and material may be subject to change. For more information, see the ML.NET introduction.

This tutorial and related sample are currently using ML.NET version 0.11. For more information, see the release notes at the dotnet/machinelearning GitHub repo.

In this tutorial, you learn how to:

  • Understand the problem
  • Select the appropriate machine learning task
  • Prepare and understand the data
  • Create a learning pipeline
  • Load and transform the data
  • Choose a learning algorithm
  • Train the model
  • Evaluate the model
  • Use the model for predictions

Prerequisites

Understand the problem

This problem is about predicting the fare of a taxi trip in New York City. At first glance, it may seem to depend simply on the distance traveled. However, taxi vendors in New York charge varying amounts for other factors such as additional passengers or paying with a credit card instead of cash.

Select the appropriate machine learning task

You want to predict the price value, which is a real value, based on the other factors in the data set. To do that you choose a regression machine learning task.

Create a console application

  1. Open Visual Studio 2017. Select File > New > Project from the menu bar. In the New Project dialog, select the Visual C# node followed by the .NET Core node. Then select the Console App (.NET Core) project template. In the Name text box, type "TaxiFarePrediction" and then select the OK button.

  2. Create a directory named Data in your project to store the data set and model files:

    In Solution Explorer, right-click the project and select Add > New Folder. Type "Data" and hit Enter.

  3. Install the Microsoft.ML NuGet Package:

    In Solution Explorer, right-click the project and select Manage NuGet Packages. Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML, select that package in the list, and select the Install button. Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.

Prepare and understand the data

  1. Download the taxi-fare-train.csv and the taxi-fare-test.csv data sets and save them to the Data folder you've created at the previous step. We use these data sets to train the machine learning model and then evaluate how accurate the model is. These data sets are originally from the NYC TLC Taxi Trip data set.

  2. In Solution Explorer, right-click each of the *.csv files and select Properties. Under Advanced, change the value of Copy to Output Directory to Copy if newer.

  3. Open the taxi-fare-train.csv data set and look at column headers in the first row. Take a look at each of the columns. Understand the data and decide which columns are features and which one is the label.

The label is the identifier of the column you want to predict. The identified features are used to predict the label.

The provided data set contains the following columns:

  • vendor_id: The ID of the taxi vendor is a feature.
  • rate_code: The rate type of the taxi trip is a feature.
  • passenger_count: The number of passengers on the trip is a feature.
  • trip_time_in_secs: The amount of time the trip took. You want to predict the fare of the trip before the trip is completed. At that moment you don't know how long the trip would take. Thus, the trip time is not a feature and you'll exclude this column from the model.
  • trip_distance: The distance of the trip is a feature.
  • payment_type: The payment method (cash or credit card) is a feature.
  • fare_amount: The total taxi fare paid is the label.

Create data classes

Create classes for the input data and the predictions:

  1. In Solution Explorer, right-click the project, and then select Add > New Item.

  2. In the Add New Item dialog box, select Class and change the Name field to TaxiTrip.cs. Then, select the Add button.

  3. Add the following using directives to the new file:

    using Microsoft.ML.Data;
    

Remove the existing class definition and add the following code, which has two classes TaxiTrip and TaxiTripFarePrediction, to the TaxiTrip.cs file:

public class TaxiTrip
{
    [LoadColumn(0)]
    public string VendorId;

    [LoadColumn(1)]
    public string RateCode;

    [LoadColumn(2)]
    public float PassengerCount;

    [LoadColumn(3)]
    public float TripTime;

    [LoadColumn(4)]
    public float TripDistance;

    [LoadColumn(5)]
    public string PaymentType;

    [LoadColumn(6)]
    public float FareAmount;
}

public class TaxiTripFarePrediction
{
    [ColumnName("Score")]
    public float FareAmount;
}

TaxiTrip is the input data class and has definitions for each of the data set columns. Use the LoadColumnAttribute attribute to specify the indices of the source columns in the data set.

The TaxiTripFarePrediction class represents predicted results. It has a single float field, FareAmount, with a Score ColumnNameAttribute attribute applied. In case of the regression task the Score column contains predicted label values.

Note

Use the float type to represent floating-point values in the input and prediction data classes.

Define data and model paths

Add the following additional using statements to the top of the Program.cs file:

using System;
using System.IO;
using Microsoft.Data.DataView;
using Microsoft.ML;
using Microsoft.ML.Data;

You need to create three fields to hold the paths to the files with data sets and the file to save the model:

  • _trainDataPath contains the path to the file with the data set used to train the model.
  • _testDataPath contains the path to the file with the data set used to evaluate the model.
  • _modelPath contains the path to the file where the trained model is stored.

Add the following code right above the Main method to specify those paths and for the _textLoader variable:

static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "taxi-fare-train.csv");
static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "taxi-fare-test.csv");
static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "Model.zip");

When building a model with ML.NET you start by creating an ML Context. This is comparable conceptually to using DbContext in Entity Framework. The environment provides a context for your machine learning job that can be used for exception tracking and logging.

Initialize variables in Main

Create a variable called mlContext and initialize it with a new instance of MLContext. Replace the Console.WriteLine("Hello World!") line with the following code in the Main method:

MLContext mlContext = new MLContext(seed: 0);

Add the following as the next line of code in the Main method to call the Train method:

var model = Train(mlContext, _trainDataPath);

The Train method executes the following tasks:

  • Loads the data.
  • Extracts and transforms the data.
  • Trains the model.
  • Saves the model as .zip file.
  • Returns the model.

The Train method trains the model. Create that method just below Main, using the following code:

public static ITransformer Train(MLContext mlContext, string dataPath)
{

}

We are passing two parameters into the Train method; an MLContext for the context (mlContext), and a string for the dataset path (dataPath). We're going to reuse this method for loading datasets.

Load and transform data

Load the data using the MLContext.Data.LoadFromTextFile wrapper for the LoadFromTextFile method. It returns a IDataView.

As the input and output of Transforms, a DataView is the fundamental data pipeline type, comparable to IEnumerable for LINQ.

In ML.NET, data is similar to a SQL view. It is lazily evaluated, schematized, and heterogenous. The object is the first part of the pipeline, and loads the data. For this tutorial, it loads a dataset with taxi trip pricing information. This is used to create the model, and train it.

Add the following code as the first line of the Train method:

IDataView dataView = mlContext.Data.LoadFromTextFile<TaxiTrip>(dataPath, hasHeader: true, separatorChar: ',');

In the next steps we refer to the columns by the names defined in the TaxiTrip class.

When the model is trained and evaluated, by default, the values in the Label column are considered as correct values to be predicted. As we want to predict the taxi trip fare, copy the FareAmount column into the Label column. To do that, use the CopyColumnsEstimator transformation class, and add the following code:

var pipeline = mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName:"FareAmount")

The algorithm that trains the model requires numeric features, so you have to transform the categorical data (VendorId, RateCode, and PaymentType) values into numbers (VendorIdEncoded, RateCodeEncoded, and PaymentTypeEncoded). To do that, use the Microsoft.ML.Transforms.OneHotEncodingTransformer> transformation class, which assigns different numeric key values to the different values in each of the columns, and add the following code:

.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "VendorIdEncoded", inputColumnName:"VendorId"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "RateCodeEncoded", inputColumnName: "RateCode"))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "PaymentTypeEncoded", inputColumnName: "PaymentType"))

The last step in data preparation combines all of the feature columns into the Features column using the mlContext.Transforms.Concatenate transformation class. By default, a learning algorithm processes only features from the Features column. Add the following code:

.Append(mlContext.Transforms.Concatenate("Features", "VendorIdEncoded", "RateCodeEncoded", "PassengerCount", "TripTime", "TripDistance", "PaymentTypeEncoded"))

Choose a learning algorithm

After adding the data to the pipeline and transforming it into the correct input format, we select a learning algorithm (learner). The learner trains the model. We chose a regression task for this problem, so we use a FastTreeRegressionTrainer learner, which is one of the regression learners provided by ML.NET.

The FastTreeRegressionTrainer training algorithm utilizes gradient boosting. Gradient boosting is a machine learning technique for regression problems. It builds each regression tree in a step-wise fashion. It uses a pre-defined loss function to measure the error in each step and correct for it in the next. The result is a prediction model that is actually an ensemble of weaker prediction models. For more information about gradient boosting, see Boosted Decision Tree Regression.

Add the following code into the Train method to add the FastTreeRegressionTrainer to the data processing code added in the previous step:

.Append(mlContext.Regression.Trainers.FastTree());

Train the model

The final step is to train the model. We train the model, TransformerChain, based on the dataset that has been loaded and transformed. Once the estimator has been defined, we train the model using the Fit while providing the already loaded training data. This returns a model to use for predictions. pipeline.Fit() trains the pipeline and returns a Transformer based on the DataView passed in. The experiment is not executed until this happens.

var model = pipeline.Fit(dataView);

Save the model

At this point, you have a model of type TransformerChain that can be integrated into any of your existing or new .NET applications. To save the model to a .zip file, add the following code at the end of the Train method:

SaveModelAsFile(mlContext, model);
return model;

Save the model as a .zip file

Create the SaveModelAsFile method, just after the Train method, using the following code:

private static void SaveModelAsFile(MLContext mlContext, ITransformer model)
{

}

The SaveModelAsFile method executes the following tasks:

  • Saves the model as a .zip file.

We need to create a method to save the model so that it can be reused and consumed in other applications. The ITransformer has a SaveTo(IHostEnvironment, Stream) method that takes in the _modelPath global field, and a Stream. Since we want to save this as a zip file, we'll create the FileStream immediately before calling the SaveTo method. Add the following code to the SaveModelAsFile method as the next line:

using (var fileStream = new FileStream(_modelPath, FileMode.Create, FileAccess.Write, FileShare.Write))
    mlContext.Model.Save(model, fileStream);

We could also display where the file was written by writing a console message with the _modelPath, using the following code:

Console.WriteLine("The model is saved to {0}", _modelPath);

Evaluate the model

Evaluation is the process of checking how well the model predicts label values. It's important that the model makes good predictions on data that was not used to train the model. One way to do this is to split the data into training and test data sets, as it's done in this tutorial. Now that you've trained the model on the training data, you can see how well it performs on the test data.

The Evaluate method evaluates the model. To create that method, add the following code below the Train method:

private static void Evaluate(MLContext mlContext, ITransformer model)
{

}

The Evaluate method executes the following tasks:

  • Loads the test dataset.
  • Creates the regression evaluator.
  • Evaluates the model and creates metrics.
  • Displays the metrics.

Add a call to the new method from the Main method, right under the Train method call, using the following code:

Evaluate(mlContext, model);

Load the test dataset using the MLContext.Data.LoadFromTextFile wrapper. You can evaluate the model using this dataset as a quality check. Add the following code to the Evaluate method:

IDataView dataView = mlContext.Data.LoadFromTextFile<TaxiTrip>(_testDataPath, hasHeader: true, separatorChar: ',');

Next, use the machine learning model parameter (a transformer) to input the features and return predictions. Add the following code to the Evaluate method as the next line:

var predictions = model.Transform(dataView);

The RegressionContext.Evaluate method computes the quality metrics for the PredictionModel using the specified dataset. It returns a RegressionMetrics object that contains the overall metrics computed by regression evaluators. To display these to determine the quality of the model, you need to get the metrics first. Add the following code as the next line in the Evaluate method:

var metrics = mlContext.Regression.Evaluate(predictions, "Label", "Score");

Add the following code to evaluate the model and produce the evaluation metrics:

Console.WriteLine();
Console.WriteLine($"*************************************************");
Console.WriteLine($"*       Model quality metrics evaluation         ");
Console.WriteLine($"*------------------------------------------------");

RSquared is another evaluation metric of the regression models. RSquared takes values between 0 and 1. The closer its value is to 1, the better the model is. Add the following code into the Evaluate method to display the RSquared value:

Console.WriteLine($"*       R2 Score:      {metrics.RSquared:0.##}");

RMS is one of the evaluation metrics of the regression model. The lower it is, the better the model is. Add the following code into the Evaluate method to display the RMS value:

Console.WriteLine($"*       RMS loss:      {metrics.Rms:#.##}");

Use the model for predictions

Predict the test data outcome with the model and a single comment

Create the TestSinglePrediction method, just after the Evaluate method, using the following code:

private static void TestSinglePrediction(MLContext mlContext)
{

}

The TestSinglePrediction method executes the following tasks:

  • Creates a single comment of test data.
  • Predicts fare amount based on test data.
  • Combines test data and predictions for reporting.
  • Displays the predicted results.

Add a call to the new method from the Main method, right under the Evaluate method call, using the following code:

TestSinglePrediction(mlContext);

Since we want to load the model from the zip file we saved, we'll create the FileStream immediately before calling the Load method. Add the following code to the TestSinglePrediction method as the next line:

ITransformer loadedModel;
using (var stream = new FileStream(_modelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
    loadedModel = mlContext.Model.Load(stream);
}

While the model is a transformer that operates on many rows of data, a very common production scenario is a need for predictions on individual examples. The PredictionEngine<TSrc,TDst> is a wrapper that is returned from the CreatePredictionEngine method. Let's add the following code to create the PredictionEngine as the next line in the TestSinglePrediction Method:

var predictionFunction = loadedModel.CreatePredictionEngine<TaxiTrip, TaxiTripFarePrediction>(mlContext);

This tutorial uses one test trip within this class. Later you can add other scenarios to experiment with the model. Add a trip to test the trained model's prediction of cost in the TestSinglePrediction method by creating an instance of TaxiTrip:

var taxiTripSample = new TaxiTrip()
{
    VendorId = "VTS",
    RateCode = "1",
    PassengerCount = 1,
    TripTime = 1140,
    TripDistance = 3.75f,
    PaymentType = "CRD",
    FareAmount = 0 // To predict. Actual/Observed = 15.5
};

We can use that to predict the fare based on a single instance of the taxi trip data. To get a prediction, use Predict on the data. Note that the input data is a string and the model includes the featurization. Your pipeline is in sync during training and prediction. You didn’t have to write preprocessing/featurization code specifically for predictions, and the same API takes care of both batch and one-time predictions.

var prediction = predictionFunction.Predict(taxiTripSample);

To display the predicted fare of the specified trip, add the following code into the TestSinglePrediction method:

Console.WriteLine($"**********************************************************************");
Console.WriteLine($"Predicted fare: {prediction.FareAmount:0.####}, actual fare: 15.5");
Console.WriteLine($"**********************************************************************");

Run the program to see the predicted taxi fare for your test case.

Congratulations! You've now successfully built a machine learning model for predicting taxi trip fares, evaluated its accuracy, and used it to make predictions. You can find the source code for this tutorial at the dotnet/samples GitHub repository.

Next steps

In this tutorial, you learned how to:

  • Understand the problem
  • Select the appropriate machine learning task
  • Prepare and understand the data
  • Create a learning pipeline
  • Load and transform the data
  • Choose a learning algorithm
  • Train the model
  • Evaluate the model
  • Use the model for predictions

Advance to the next tutorial to learn more.