教程:将回归与模型生成器配合使用以预测价格Tutorial: Predict prices using regression with Model Builder

了解如何使用 ML.NET 模型生成器来生成用于预测价格的回归模型。Learn how to use ML.NET Model Builder to build a regression model to predict prices. 在本教程中开发的.NET 控制台应用根据纽约市出租车费的历史数据预测出租车费。The .NET console app that you develop in this tutorial predicts taxi fares based on historical New York taxi fare data.

模型生成器价格预测模板可用于任何需要数值预测值的方案。The Model Builder price prediction template can be used for any scenario requiring a numerical prediction value. 示例方案包括:房价预测、需求预测和销售额预测。Example scenarios include: house price prediction, demand prediction, and sales forecasting.

在本教程中,你将了解:In this tutorial, you learn how to:

  • 准备和了解数据Prepare and understand the data
  • 选择方案Choose a scenario
  • 加载数据Load the data
  • 定型模型Train the model
  • 评估模型Evaluate the model
  • 使用预测模型Use the model for predictions

备注

模型生成器当前为预览版。Model Builder is currently in Preview.

先决条件Pre-requisites

请访问模型生成器安装指南,查看先决条件和安装说明的列表。For a list of pre-requisites and installation instructions, visit the Model Builder installation guide.

创建控制台应用程序Create a console application

  1. 创建名为“TaxiFarePrediction”的 C# .NET Core 控制台应用程序 。Create a C# .NET Core Console Application called "TaxiFarePrediction". 请确保未选中“将解决方案和项目放置在同一目录中”(VS 2019) 或已选中“创建解决方案的目录”(VS 2017) 。Make sure Place solution and project in the same directory is unchecked (VS 2019), or Create directory for solution is checked (VS 2017).

准备和了解数据Prepare and understand the data

  1. 在项目中创建一个名为“数据”的目录来保存数据集文件 。Create a directory named Data in your project to store the data set files.

  2. 用于训练和评估机器学习模型的数据集最初来自 NYC TLC 出租车行程数据集。The data set used to train and evaluate the machine learning model is originally from the NYC TLC Taxi Trip data set.

    1. 要下载数据集,请导航至 taxi-fare-train.csv 下载链接To download the data set, navigate to the taxi-fare-train.csv download link.

    2. 页面加载完成后,右键单击页面上的任意位置,然后选择“另存为” 。When the page loads, right-click anywhere on the page and select Save as.

    3. 使用“另存为”对话框 将文件保存在你在上一步创建的“数据” 文件夹中。Use the Save As Dialog to save the file in the Data folder you created at the previous step.

  3. 在解决方案资源管理器中,右键单击“taxi-fare-train.csv”文件并选择“属性” 。In Solution Explorer, right-click the taxi-fare-train.csv file and select Properties. 在“高级”下,将“复制到输出目录”的值更改为“如果较新则复制” 。Under Advanced, change the value of Copy to Output Directory to Copy if newer.

taxi-fare-train.csv 数据集中的每一行都包含一辆出租车的详细行程。Each row in the taxi-fare-train.csv data set contains details of trips made by a taxi.

  1. 打开“taxi-fare-train.csv”数据集 Open the taxi-fare-train.csv data set

    提供的数据集包含以下列:The provided data set contains the following columns:

    • vendor_id: 出租车供应商的 ID 是一项特征。vendor_id: The ID of the taxi vendor is a feature.
    • rate_code: 出租车行程的费率类型是一项特征。rate_code: The rate type of the taxi trip is a feature.
    • passenger_count: 行程中的乘客人数是一项特征。passenger_count: The number of passengers on the trip is a feature.
    • trip_time_in_secs: 这次行程所花的时间。trip_time_in_secs: The amount of time the trip took. 希望在行程完成前预测行程费用。You want to predict the fare of the trip before the trip is completed. 当时并不知道行程有多长。At that moment you don't know how long the trip would take. 因此,行程时间不是一项特征,需要从模型删除此列。Thus, the trip time is not a feature and you'll exclude this column from the model.
    • trip_distance: 行程距离是一项特征。trip_distance: The distance of the trip is a feature.
    • payment_type: 付款方式(现金或信用卡)是一项特征。payment_type: The payment method (cash or credit card) is a feature.
    • fare_amount: 支付的总出租车费用是一个标签。fare_amount: The total taxi fare paid is the label.

label 是要预测的列。The label is the column you want to predict. 在执行回归任务时,目标是预测一个数字值。When performing a regression task, the goal is to predict a numerical value. 在此价格预测方案中,要预测的是出租车的乘车费用。In this price prediction scenario, the cost of a taxi ride is being predicted. 所以“fare_amount”是标签 。Therefore, the fare_amount is the label. 标识的 features 是为模型提供的用来预测 label 的输入。The identified features are the inputs you give the model to predict the label. 在这种情况下,剩余的列(trip_time_in_secs 除外)都用作特征或输入来预测车费金额。 In this case, the rest of the columns with the exception of trip_time_in_secs are used as features or inputs to predict the fare amount.

选择方案Choose a scenario

为了训练模型,需要从模型生成器提供的可用机器学习方案列表中进行选择。To train your model, you need to select from the list of available machine learning scenarios provided by Model Builder. 在本例中,选择的方案是 Price PredictionIn this case, the scenario is Price Prediction.

  1. 在“解决方案资源管理器”中,右键单击“TaxiFarePrediction”项目,然后选择“添加” > “机器学习” 。In Solution Explorer, right-click the TaxiFarePrediction project, and select Add > Machine Learning.
  2. 在模型生成器工具的方案步骤中,选择“价格预测”方案 。In the scenario step of the Model Builder tool, select Price Prediction scenario.

加载数据Load the data

模型生成器接受来自两个源的数据:SQL Server 数据库或者 csv 或 tsv 格式的本地文件。Model Builder accepts data from two sources, a SQL Server database or a local file in csv or tsv format.

  1. 在模型生成器工具的数据步骤中,选择数据源下拉列表中的“文件” 。In the data step of the Model Builder tool, select File from the data source dropdown.
  2. 选择“选择文件”文本框旁边的按钮,并使用文件资源管理器浏览到“数据”目录中的“taxi-fare-test.csv”,然后选择该文件 Select the button next to the Select a file text box and use File Explorer to browse and select the taxi-fare-test.csv in the Data directory
  3. 在“要预测的列(标签)”下拉列表中选择“fare_amount” 。Choose fare_amount in the Column to Predict (Label) dropdown.
  4. 展开“输入列(特征)”下拉列表,取消选中 trip_time_in_secs 列,以在训练时排除,不其作为特征。 Expand the Input Columns (Features) dropdown and uncheck the trip_time_in_secs column to exclude it as a feature during training. 导航到模型生成器工具的训练步骤。Navigate to the train step of the Model Builder tool.

定型模型Train the model

在本教程中,用于训练价格预测模型的机器学习任务是回归。The machine learning task used to train the price prediction model in this tutorial is regression. 在模型训练过程中,模型生成器使用不同的回归算法和设置训练各个模型,以便为数据集找到性能最佳的模型。During the model training process, Model Builder trains separate models using different regression algorithms and settings to find the best performing model for your dataset.

模型训练所需的时间与数据量成正比。The time required for the model to train is proportionate to the amount of data. 模型生成器会根据数据源的大小自动选择“训练时间(秒)”的默认值 。Model Builder automatically selects a default value for Time to train (seconds) based on the size of your data source.

  1. 如果不希望延长训练时间,则保持“训练时间(秒)”的默认值不变 。Leave the default value as is for Time to train (seconds) unless you prefer to train for a longer time.
  2. 选择“开始训练” 。Select Start Training.

在训练过程中,进度数据显示在训练步骤中的 Progress 部分。Throughout the training process, progress data is displayed in the Progress section of the train step.

  • “状态”显示训练进程的完成状态。Status displays the completion status of the training process.
  • “最高准确性”显示截至目前由模型生成器找到的性能最佳的模型的准确性。Best accuracy displays the accuracy of the best performing model found by Model Builder so far. 准确性越高,意味着模型对测试数据的预测越准确。Higher accuracy means the model predicted more correctly on test data.
  • “最佳算法”显示截至目前由模型生成器找到的性能最佳的算法的名称。Best algorithm displays the name of the best performing algorithm performed found by Model Builder so far.
  • “最新算法”显示模型生成器为了训练模型采用的最新算法名称。Last algorithm displays the name of the algorithm most recently used by Model Builder to train the model.

训练完成后,导航到评估步骤。Once training is complete, navigate to the evaluate step.

评估模型Evaluate the model

训练步骤的成果将是一个模型,该模型具备最佳的性能。The result of the training step will be one model which had the best performance. 在模型生成器工具的评估步骤中,输出部分将包含“最佳模型”项中性能最佳的模型使用的算法,并包含“最佳模型质量 (RSquared)”中的指标 。In the evaluate step of the Model Builder tool, the output section, will contain the algorithm used by the best performing model in the Best Model entry along with metrics in Best Model Quality (RSquared). 此外还有一个摘要表格,包含性能最佳的前五种模型以及它们的指标信息。Additionally, a summary table containing top five models and their metrics.

如果对自己的准确性指标不满意,尝试提高模型准确性的简单方法是增加模型的训练时间或使用更多数据。If you're not satisfied with your accuracy metrics, some easy ways to try and improve model accuracy are to increase the amount of time to train the model or use more data. 否则,导航到代码步骤。Otherwise, navigate to the code step.

添加代码进行预测Add the code to make predictions

训练期间会创建两个项目。Two projects will be created as a result of the training process.

  • TaxiFarePredictionML.ConsoleApp:包含模型训练和示例消费代码的 .NET Core 控制台应用程序。TaxiFarePredictionML.ConsoleApp: A .NET Core Console application that contains the model training and sample consumption code.
  • TaxiFarePredictionML.Model:一个 .NET Standard 类库,包含定义输入和输出模型数据架构的数据模型、训练期间性能最佳的模型的保存版本以及用于执行预测的帮助程序类(称为 ConsumeModel)。TaxiFarePredictionML.Model: A .NET Standard class library containing the data models that define the schema of input and output model data, the saved version of the best performing model during training and a helper class called ConsumeModel to make predictions.
  1. 在模型生成器工具的代码步骤中,选择“添加项目”,将自动生成的项目添加到解决方案 。In the code step of the Model Builder tool, select Add Projects to add the auto-generated projects to the solution.

  2. 打开 TaxiFarePrediction 项目中的 Program.cs 文件 。Open the Program.cs file in the TaxiFarePrediction project.

  3. 添加以下 using 语句以引用 TaxiFarePredictionML.Model 项目: Add the following using statement to reference the TaxiFarePredictionML.Model project:

    using System;
    using TaxiFarePredictionML.Model;
    
  4. 要使用模型对新数据进行预测,请在应用程序的 Main 方法内创建 ModelInput 类的新实例。To make a prediction on new data using the model, create a new instance of the ModelInput class inside the Main method of your application. 请注意,费用金额不是输入的一部分。Notice that the fare amount is not part of the input. 这是因为模型将为它生成预测。This is because the model will generate the prediction for it.

    // Create sample data
    ModelInput input = new ModelInput()
    {
        Vendor_id = "CMT",
        Rate_code = 1,
        Passenger_count = 1,
        Trip_distance = 3.8f,
        Payment_type = "CRD"
    };
    
  5. 使用 ConsumeModel 类中的 Predict 方法。Use the Predict method from the ConsumeModel class. Predict 方法将加载经过训练的模型,为模型创建 PredictionEngine 并使用它对新数据进行预测。The Predict method loads the trained model, create a PredictionEngine for the model and uses it to make predictions on new data.

    // Make prediction
    ModelOutput prediction = ConsumeModel.Predict(input);
    
    // Print Prediction
    Console.WriteLine($"Predicted Fare: {prediction.Score}");
    Console.ReadKey();
    
  6. 运行该应用程序。Run the application.

    该程序生成的输出应类似于下面的代码段:The output generated by the program should look similar to the snippet below:

    Predicted Fare: 14.96086
    

如果稍后需要在另一个解决方案中引用生成的项目,可以在 C:\Users\%USERNAME%\AppData\Local\Temp\MLVSTools 目录中找到它们。If you need to reference the generated projects at a later time inside of another solution, you can find them inside the C:\Users\%USERNAME%\AppData\Local\Temp\MLVSTools directory.

后续步骤Next Steps

在本教程中,你将了解:In this tutorial, you learned how to:

  • 准备和了解数据Prepare and understand the data
  • 选择方案Choose a scenario
  • 加载数据Load the data
  • 定型模型Train the model
  • 评估模型Evaluate the model
  • 使用预测模型Use the model for predictions

其他资源Additional Resources

若要详细了解本教程中所述的主题,请访问以下资源:To learn more about topics mentioned in this tutorial, visit the following resources: