教程:将多类分类与 ML.NET 配合使用,对支持问题分类Tutorial: Categorize support issues using multiclass classification with ML.NET

本示例教程演示如何使用 ML.NET 创建 GitHub 问题分类器来训练模型,使其通过 Visual Studio 中使用 C# 的 .NET Core 控制台应用程序为 GitHub 问题分类和预测区域标签。This sample tutorial illustrates using ML.NET to create a GitHub issue classifier to train a model that classifies and predicts the Area label for a GitHub issue via a .NET Core console application using C# in Visual Studio.

在本教程中,你将了解:In this tutorial, you learn how to:

  • 准备数据Prepare your data
  • 转换数据Transform the data
  • 定型模型Train the model
  • 评估模型Evaluate the model
  • 使用训练的模型预测Predict with the trained model
  • 使用加载模型部署和预测Deploy and Predict with a loaded model

可以在 dotnet/samples 存储库中找到本教程的源代码。You can find the source code for this tutorial at the dotnet/samples repository.

先决条件Prerequisites

创建控制台应用程序Create a console application

创建项目Create a project

  1. 打开 Visual Studio 2017。Open Visual Studio 2017. 从菜单栏中选择“文件” > “新建” > “项目”。Select File > New > Project from the menu bar. 在“新项目”对话框中,依次选择“Visual C#”和“.NET Core”节点。In the New Project dialog, select the Visual C# node followed by the .NET Core node. 然后,选择“控制台应用程序(.NET Core)”项目模板。Then select the Console App (.NET Core) project template. 在“名称”文本框中,键入“GitHubIssueClassification”,然后选择“确定”按钮。In the Name text box, type "GitHubIssueClassification" and then select the OK button.

  2. 在项目中创建一个名为“Data”的目录来保存数据集文件:Create a directory named Data in your project to save your data set files:

    在“解决方案资源管理器”中,右键单击项目,然后选择“添加” > “新文件夹”。In Solution Explorer, right-click on your project and select Add > New Folder. 键入“Data”,然后按 Enter。Type "Data" and hit Enter.

  3. 在项目中创建一个名为“Models”的目录来保存模型:Create a directory named Models in your project to save your model:

    在“解决方案资源管理器”中,右键单击项目,然后选择“添加” > “新文件夹”。In Solution Explorer, right-click on your project and select Add > New Folder. 键入“Models”,然后按 Enter。Type "Models" and hit Enter.

  4. 安装“Microsoft.ML NuGet 包”:Install the Microsoft.ML NuGet Package:

    备注

    除非另有说明,否则本示例使用前面提到的 NuGet 包的最新稳定版本。This sample uses the latest stable version of the NuGet packages mentioned unless otherwise stated.

    在“解决方案资源管理器”中,右键单击项目,然后选择“管理 NuGet 包”。In Solution Explorer, right-click on your project and select Manage NuGet Packages. 选择“nuget.org”作为包源,然后选择“浏览”选项卡并搜索“Microsoft.ML”,再选择“安装”按钮 。Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML and select the Install button. 选择“预览更改”对话框上的“确定”按钮,如果你同意所列包的许可条款,则选择“接受许可”对话框上的“我接受”按钮。Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.

准备数据Prepare your data

  1. 下载 issues_train.tsvissues_test.tsv 数据集,并将它们保存到先前创建的“Data”文件夹。Download the issues_train.tsv and the issues_test.tsv data sets and save them to the Data folder previously created. 第一个数据集用于定型机器学习模型,第二个数据集可用来评估模型的准确度。The first dataset trains the machine learning model and the second can be used to evaluate how accurate your model is.

  2. 在“解决方案资源管理器”中,右键单击每个 *.tsv 文件,然后选择“属性”。In Solution Explorer, right-click each of the *.tsv files and select Properties. 在“高级”下,将“复制到输出目录”的值更改为“如果较新则复制” 。Under Advanced, change the value of Copy to Output Directory to Copy if newer.

创建类和定义路径Create classes and define paths

将以下附加的 using 语句添加到“Program.cs”文件顶部:Add the following additional using statements to the top of the Program.cs file:

using System;
using System.IO;
using System.Linq;
using Microsoft.ML;

创建 3 个全局字段,来保存最近下载的文件的路径以及 MLContextDataViewPredictionEngine 的全局变量:Create three global fields to hold the paths to the recently downloaded files, and global variables for the MLContext,DataView, and PredictionEngine:

  • _trainDataPath 具有用于定型模型的数据集路径。_trainDataPath has the path to the dataset used to train the model.
  • _testDataPath 具有用于评估模型的数据集路径。_testDataPath has the path to the dataset used to evaluate the model.
  • _modelPath 具有在其中保存定型模型的路径。_modelPath has the path where the trained model is saved.
  • _mlContext 是用于提供处理上下文的 MLContext_mlContext is the MLContext that provides processing context.
  • _trainingDataView 是用于处理定型数据集的 IDataView_trainingDataView is the IDataView used to process the training dataset.
  • _predEngine 是用于单个预测的 PredictionEngine<TSrc,TDst>_predEngine is the PredictionEngine<TSrc,TDst> used for single predictions.

将以下代码添加到 Main 方法正上方的行中以指定这些路径和其他变量:Add the following code to the line directly above the Main method to specify those paths and the other variables:

private static string _appPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string _trainDataPath => Path.Combine(_appPath, "..", "..", "..", "Data", "issues_train.tsv");
private static string _testDataPath => Path.Combine(_appPath, "..", "..", "..", "Data", "issues_test.tsv");
private static string _modelPath => Path.Combine(_appPath, "..", "..", "..", "Models", "model.zip");

private static MLContext _mlContext;
private static PredictionEngine<GitHubIssue, IssuePrediction> _predEngine;
private static ITransformer _trainedModel;
static IDataView _trainingDataView;

为输入数据和预测创建一些类。Create some classes for your input data and predictions. 向项目添加一个新类:Add a new class to your project:

  1. 在“解决方案资源管理器”中,右键单击项目,然后选择“添加” > “新项”。In Solution Explorer, right-click the project, and then select Add > New Item.

  2. 在“添加新项”对话框中,选择“类”并将“名称”字段更改为“GitHubIssueData.cs” 。In the Add New Item dialog box, select Class and change the Name field to GitHubIssueData.cs. 然后,选择“添加”按钮。Then, select the Add button.

    “GitHubIssueData.cs”文件随即在代码编辑器中打开。The GitHubIssueData.cs file opens in the code editor. 将下面的 using 语句添加到 GitHubIssueData.cs 的顶部:Add the following using statement to the top of GitHubIssueData.cs:

using Microsoft.ML.Data;

删除现有类定义并向“GitHubIssueData.cs”文件添加以下代码,其中有两个类 GitHubIssueIssuePredictionRemove the existing class definition and add the following code, which has two classes GitHubIssue and IssuePrediction, to the GitHubIssueData.cs file:

public class GitHubIssue
{
    [LoadColumn(0)]
    public string ID { get; set; }
    [LoadColumn(1)]
    public string Area { get; set; }
    [LoadColumn(2)]
    public string Title { get; set; }
    [LoadColumn(3)]
    public string Description { get; set; }
}

public class IssuePrediction
{
    [ColumnName("PredictedLabel")]
    public string Area;
}

label 是要预测的列。The label is the column you want to predict. 标识的 Features 是为模型提供的用来预测标签的输入。The identified Features are the inputs you give the model to predict the Label.

使用 LoadColumnAttribute 在数据集中指定源列的索引。Use the LoadColumnAttribute to specify the indices of the source columns in the data set.

GitHubIssue 是输入数据集类,具有以下 String 字段:GitHubIssue is the input dataset class and has the following String fields:

  • 第一列 ID(GitHub 问题 ID)the first column ID (GitHub Issue ID)
  • 第二列 Area(定型预测)the second column Area (the prediction for training)
  • 第三列 Title(GitHub 问题标题)是用于预测 Area 的第一个 featurethe third column Title (GitHub issue title) is the first feature used for predicting the Area
  • 第四列 Description 是用于预测 Area 的第二个 featurethe fourth column Description is the second feature used for predicting the Area

IssuePrediction 是在定型模型后用于预测的类。IssuePrediction is the class used for prediction after the model has been trained. 它有一个 string (Area) 和一个 PredictedLabel ColumnName 属性。It has a single string (Area) and a PredictedLabel ColumnName attribute. PredictedLabel 在预测和评估过程中使用。The PredictedLabel is used during prediction and evaluation. 对于计算,将使用带定型数据的输入、预测值和模型。For evaluation, an input with training data, the predicted values, and the model are used.

所有 ML.NET 操作都从 MLContext 类开始。All ML.NET operations start in the MLContext class. 初始化 mlContext 创建了新的 ML.NET 环境,可以在模型创建工作流对象之间共享。Initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. 从概念上讲,它与 Entity Framework 中的 DBContext 类似。It's similar, conceptually, to DBContext in Entity Framework.

在 Main 中初始化变量Initialize variables in Main

使用具有随机种子 (seed: 0) 的新实例 MLContext 初始化 _mlContext 全局变量,以获得跨多个定型的可重复/确定性结果。Initialize the _mlContext global variable with a new instance of MLContext with a random seed (seed: 0) for repeatable/deterministic results across multiple trainings. 用下面 Main 方法中的代码替换 Console.WriteLine("Hello World!") 行:Replace the Console.WriteLine("Hello World!") line with the following code in the Main method:

_mlContext = new MLContext(seed: 0);

加载数据Load the data

ML.NET 使用 IDataView 类灵活、有效地描述数字或文本表格数据。ML.NET uses the IDataView class as a flexible, efficient way of describing numeric or text tabular data. IDataView 可以加载文本文件或进行实时加载(例如,SQL 数据库或日志文件)。IDataView can load either text files or in real time (for example, SQL database or log files).

要初始化并加载 _trainingDataView 全局变量以将其用于管道,请在 mlContext 初始化后添加以下代码:To initialize and load the _trainingDataView global variable in order to use it for the pipeline, add the following code after the mlContext initialization:

_trainingDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_trainDataPath,hasHeader: true);

LoadFromTextFile() 用于定义数据架构并读取文件。The LoadFromTextFile() defines the data schema and reads in the file. 它使用数据路径变量并返回 IDataViewIt takes in the data path variables and returns an IDataView.

将以下代码作为下一行代码添加到 Main 方法中:Add the following as the next line of code in the Main method:

var pipeline = ProcessData();

ProcessData 方法执行以下任务:The ProcessData method executes the following tasks:

  • 提取并转换数据。Extracts and transforms the data.
  • 返回处理管道。Returns the processing pipeline.

使用下面的代码紧随 Main 方法后创建 ProcessData 方法:Create the ProcessData method, just after the Main method, using the following code:

public static IEstimator<ITransformer> ProcessData()
{

}

提取功能和转换数据Extract Features and transform the data

由于要预测 GitHubIssue 的区域 GitHub 标签,因此请使用 MapValueToKey() 方法将 Area 列转换为数字键类型 Label 列(分类算法所接受的格式)并将其添加为新的数据集列:As you want to predict the Area GitHub label for a GitHubIssue, use the MapValueToKey() method to transform the Area column into a numeric key type Label column (a format accepted by classification algorithms) and add it as a new dataset column:

var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")

接下来,调用 mlContext.Transforms.Text.FeaturizeText,它会将文本(TitleDescription)列转换为每个名为 TitleFeaturizedDescriptionFeaturized 的值的数字向量。Next, call mlContext.Transforms.Text.FeaturizeText, which transforms the text (Title and Description) columns into a numeric vector for each called TitleFeaturized and DescriptionFeaturized. 使用以下代码将两列的特征化附加到管道:Append the featurization for both columns to the pipeline with the following code:

.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized"))
.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))

数据准备最后一步使用 Concatenate() 方法将所有特征列合并到“特征”列。The last step in data preparation combines all of the feature columns into the Features column using the Concatenate() method. 默认情况下,学习算法仅处理“特征”列的特征。By default, a learning algorithm processes only features from the Features column. 使用以下代码将此转换附加到管道:Append this transformation to the pipeline with the following code:

.Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))

接下来,附加一个 AppendCacheCheckpoint 来缓存数据视图,以便在使用缓存多次循环访问数据时获得更好的性能,如下面的代码所示:Next, append a AppendCacheCheckpoint to cache the DataView so when you iterate over the data multiple times using the cache might get better performance, as with the following code:

.AppendCacheCheckpoint(_mlContext);

警告

对小/中型数据集使用 AppendCacheCheckpoint 可以降低训练时间。Use AppendCacheCheckpoint for small/medium datasets to lower training time. 在处理大型数据集时不使用它(删除 .AppendCacheCheckpoint())。Do NOT use it (remove .AppendCacheCheckpoint()) when handling very large datasets.

ProcessData 方法的末尾返回管道。Return the pipeline at the end of the ProcessData method.

return pipeline;

此步骤处理预处理/特征化。This step handles preprocessing/featurization. 使用 ML.NET 中可用的其他组件可以在使用模型时生成更佳结果。Using additional components available in ML.NET can enable better results with your model.

生成和定型模型Build and train the model

将以下调用添加到 BuildAndTrainModel 方法作为 Main 方法的下一行代码:Add the following call to the BuildAndTrainModelmethod as the next line of code in the Main method:

var trainingPipeline = BuildAndTrainModel(_trainingDataView, pipeline);

BuildAndTrainModel 方法执行以下任务:The BuildAndTrainModel method executes the following tasks:

  • 创建定型算法类。Creates the training algorithm class.
  • 定型模型。Trains the model.
  • 根据定型数据预测区域。Predicts area based on training data.
  • 返回模型。Returns the model.

使用下面的代码紧随 Main 方法后创建 BuildAndTrainModel 方法:Create the BuildAndTrainModel method, just after the Main method, using the following code:

public static IEstimator<ITransformer> BuildAndTrainModel(IDataView trainingDataView, IEstimator<ITransformer> pipeline)
{

}

有关分类任务About the classification task

分类是一项机器学习任务,它使用数据来确定某个项或数据行的类别、类型或类,并且通常是以下类型之一:Classification is a machine learning task that uses data to determine the category, type, or class of an item or row of data and is frequently one of the following types:

  • 二元:A 或 B。Binary: either A or B.
  • 多类:可以通过使用单个模型来预测多个类别。Multiclass: multiple categories that can be predicted by using a single model.

对于此类问题,请使用多类分类学习算法,因为你的问题类别预测可能是多个类别(多类)而不是仅两个(二元)中的一个。For this type of problem, use a Multiclass classification learning algorithm, since your issue category prediction can be one of multiple categories (multiclass) rather than just two (binary).

将机器学习算法追加到数据转换定义中,方法是在 BuildAndTrainModel() 中添加以下代码作为第一行代码:Append the machine learning algorithm to the data transformation definitions by adding the following as the first line of code in BuildAndTrainModel():

var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))
        .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

SdcaMaximumEntropy 即多类分类训练算法。The SdcaMaximumEntropy is your multiclass classification training algorithm. 它追加到 pipeline 并接受特征化的 TitleDescription (Features) 以及 Label 输入参数,以便从历史数据中学习。This is appended to the pipeline and accepts the featurized Title and Description (Features) and the Label input parameters to learn from the historic data.

定型模型Train the model

BuildAndTrainModel() 方法中添加以下代码作为下一代码行,使模型适应 splitTrainSet 数据,并返回经过训练的模型:Fit the model to the splitTrainSet data and return the trained model by adding the following as the next line of code in the BuildAndTrainModel() method:

_trainedModel = trainingPipeline.Fit(trainingDataView);

Fit() 方法通过转换数据集并应用训练来训练模型。The Fit()method trains your model by transforming the dataset and applying the training.

PredictionEngine 是一个简便 API,可用于传入单个数据实例,然后对其执行预测。The PredictionEngine is a convenience API, which allows you to pass in and then perform a prediction on a single instance of data. 将此 API 添加为 BuildAndTrainModel() 方法中的下一行:Add this as the next line in the BuildAndTrainModel() method:

_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(_trainedModel);

使用训练的模型预测Predict with the trained model

通过创建一个 GitHubIssue 实例,在 Predict 方法中添加一个 GitHub 问题来测试定型模型的预测:Add a GitHub issue to test the trained model's prediction in the Predict method by creating an instance of GitHubIssue:

GitHubIssue issue = new GitHubIssue() {
    Title = "WebSockets communication is slow in my machine",
    Description = "The WebSockets communication used under the covers by SignalR looks like is going slow in my development machine.."
};

使用 Predict() 函数对单行数据进行预测:Use the Predict() function makes a prediction on a single row of data:

var prediction = _predEngine.Predict(issue);

使用模型:预测结果Using the model: prediction results

显示 GitHubIssue 和相应的 Area 标签预测以便共享结果,并采取相应措施。Display GitHubIssue and corresponding Area label prediction in order to share the results and act on them accordingly. 使用以下 Console.WriteLine() 代码创建结果显示:Create a display for the results using the following Console.WriteLine() code:

Console.WriteLine($"=============== Single Prediction just-trained-model - Result: {prediction.Area} ===============");

返回定型模型以用于评估Return the model trained to use for evaluation

BuildAndTrainModel 方法末尾返回模型。Return the model at the end of the BuildAndTrainModel method.

return trainingPipeline;

评估模型Evaluate the model

你已经创建和定型模型,现在需要使用不同的数据集对其进行评估以保证质量和进行验证。Now that you've created and trained the model, you need to evaluate it with a different dataset for quality assurance and validation. Evaluate 方法中,将传入在 BuildAndTrainModel 中创建的模型以进行评估。In the Evaluate method, the model created in BuildAndTrainModel is passed in to be evaluated. 紧随 BuildAndTrainModel 后创建 Evaluate 方法,如以下代码所示:Create the Evaluate method, just after BuildAndTrainModel, as in the following code:

public static void Evaluate(DataViewSchema trainingDataViewSchema)
{

}

Evaluate 方法执行以下任务:The Evaluate method executes the following tasks:

  • 加载测试数据集。Loads the test dataset.
  • 创建多类评估程序。Creates the multiclass evaluator.
  • 评估模型并创建指标。Evaluates the model and create metrics.
  • 显示指标。Displays the metrics.

使用下面的代码,在 BuildAndTrainModel 方法调用的正下方,从 Main 方法中添加对新方法的调用:Add a call to the new method from the Main method, right under the BuildAndTrainModel method call, using the following code:

Evaluate(_trainingDataView.Schema);

与之前对训练数据集所执行的操作那样,通过将以下代码添加到 Evaluate 方法来加载测试数据集:As you did previously with the training dataset, load the test dataset by adding the following code to the Evaluate method:

var testDataView = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_testDataPath,hasHeader: true);

Evaluate() 方法使用指定数据集计算模型的质量指标。The Evaluate() method computes the quality metrics for the model using the specified dataset. 它返回 MulticlassClassificationMetrics 对象,其中包含由多类分类计算器计算出的总体指标。It returns a MulticlassClassificationMetrics object that contains the overall metrics computed by multiclass classification evaluators. 要显示指标来确定模型质量,需要先获取这些指标。To display the metrics to determine the quality of the model, you need to get them first. 请注意使用机器学习 _trainedModel 全局变量 (ITransformer) 的 Transform() 方法来输入特征和返回预测。Notice the use of the Transform() method of the machine learning _trainedModel global variable (an ITransformer) to input the features and return predictions. 将以下代码作为下一行添加到 Evaluate 方法中:Add the following code to the Evaluate method as the next line:

var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));

针对多类分类评估以下指标:The following metrics are evaluated for multiclass classification:

  • 微观准确性 - 每个“样本-类”对准确性指标的贡献度相同。Micro Accuracy - Every sample-class pair contributes equally to the accuracy metric. 通常会希望微观准确性尽可能接近 1。You want Micro Accuracy to be as close to one as possible.

  • 宏观准确性 - 每个类对准确性指标的贡献度相同。Macro Accuracy - Every class contributes equally to the accuracy metric. 占比较小的类与占比较大的类拥有同等的权重。Minority classes are given equal weight as the larger classes. 通常会希望宏观准确性尽可能接近 1。You want Macro Accuracy to be as close to one as possible.

  • 对数损失 - 请参阅对数损失Log-loss - see Log Loss. 通常会希望对数损失尽可能接近 0。You want Log-loss to be as close to zero as possible.

  • 对数损失减小 - 取值范围为 [-inf,1.00],其中 1.00 表示非常精准的预测结果,0 表示准确性一般的预测。Log-loss reduction - Ranges from [-inf, 1.00], where 1.00 is perfect predictions and 0 indicates mean predictions. 通常会希望对数损失减少尽可能接近 1。You want Log-loss reduction to be as close to one as possible.

显示用于模型验证的指标Displaying the metrics for model validation

使用下面的代码显示指标、共享结果,然后处理它们:Use the following code to display the metrics, share the results, and then act on them:

Console.WriteLine($"*************************************************************************************************************");
Console.WriteLine($"*       Metrics for Multi-class Classification model - Test Data     ");
Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");
Console.WriteLine($"*       MicroAccuracy:    {testMetrics.MicroAccuracy:0.###}");
Console.WriteLine($"*       MacroAccuracy:    {testMetrics.MacroAccuracy:0.###}");
Console.WriteLine($"*       LogLoss:          {testMetrics.LogLoss:#.###}");
Console.WriteLine($"*       LogLossReduction: {testMetrics.LogLossReduction:#.###}");
Console.WriteLine($"*************************************************************************************************************");

将模型保存到文件Save the model to a file

对模型满意后,将其保存到文件中以便稍后或在其他应用程序中进行预测。Once satisfied with your model, save it to a file to make predictions at a later time or in another application. 将以下代码添加到 Evaluate 方法中。Add the following code to the Evaluate method.

SaveModelAsFile(_mlContext, trainingDataViewSchema, _trainedModel);

Evaluate 方法下创建 SaveModelAsFile 方法。Create the SaveModelAsFile method below your Evaluate method.

private static void SaveModelAsFile(MLContext mlContext,DataViewSchema trainingDataViewSchema, ITransformer model)
{

}

将以下代码添加到 SaveModelAsFile 方法。Add the following code to your SaveModelAsFile method. 此代码使用 Save 方法对训练后的模型进行序列化并将其存储为 zip 文件。This code uses the Save method to serialize and store the trained model as a zip file.

mlContext.Model.Save(model, trainingDataViewSchema, _modelPath);

使用模型进行部署和预测Deploy and Predict with a model

使用下面的代码,在 Evaluate 方法调用的正下方,从 Main 方法中添加对新方法的调用:Add a call to the new method from the Main method, right under the Evaluate method call, using the following code:

PredictIssue();

使用以下代码恰好在 Evaluate 方法的后面(恰在 SaveModelAsFile 方法之前)创建 PredictIssue 方法:Create the PredictIssue method, just after the Evaluate method (and just before the SaveModelAsFile method), using the following code:

private static void PredictIssue()
{

}

PredictIssue 方法执行以下任务:The PredictIssue method executes the following tasks:

  • 加载已保存的模型Loads the saved model
  • 创建测试数据的单个问题。Creates a single issue of test data.
  • 根据测试数据预测区域。Predicts Area based on test data.
  • 结合测试数据和预测进行报告。Combines test data and predictions for reporting.
  • 显示预测结果。Displays the predicted results.

通过向 PredictIssue 方法中添加以下代码,将保存的模型加载到应用程序中:Load the saved model into your application by adding the following code to the PredictIssue method:

ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);

通过创建一个 GitHubIssue 实例,在 Predict 方法中添加一个 GitHub 问题来测试定型模型的预测:Add a GitHub issue to test the trained model's prediction in the Predict method by creating an instance of GitHubIssue:

GitHubIssue singleIssue = new GitHubIssue() { Title = "Entity Framework crashes", Description = "When connecting to the database, EF is crashing" };

与之前一样,使用以下代码创建 PredictionEngine 实例:As you did previously, create a PredictionEngine instance with the following code:

_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssue, IssuePrediction>(loadedModel);

PredictionEngine 是一个简便 API,可使用它对单个数据实例执行预测。The PredictionEngine is a convenience API, which allows you to perform a prediction on a single instance of data. PredictionEngine 不是线程安全型。PredictionEngine is not thread-safe. 可以在单线程环境或原型环境中使用。It's acceptable to use in single-threaded or prototype environments. 为了在生产环境中提高性能和线程安全,请使用 PredictionEnginePool 服务,这将创建一个在整个应用程序中使用的 PredictionEngine 对象的 ObjectPoolFor improved performance and thread safety in production environments, use the PredictionEnginePool service, which creates an ObjectPool of PredictionEngine objects for use throughout your application. 请参阅本指南,了解如何在 ASP.NET Core Web API 中使用 PredictionEnginePoolSee this guide on how to use PredictionEnginePool in an ASP.NET Core Web API.

备注

PredictionEnginePool 服务扩展目前处于预览状态。PredictionEnginePool service extension is currently in preview.

通过将以下代码添加到预测的 PredictIssue 方法,使用 PredictionEngine 来预测区域 GitHub 标签:Use the PredictionEngine to predict the Area GitHub label by adding the following code to the PredictIssue method for the prediction:

var prediction = _predEngine.Predict(singleIssue);

使用加载后的模型进行预测Using the loaded model for prediction

显示 Area 以便对问题进行分类并对其进行相应操作。Display Area in order to categorize the issue and act on it accordingly. 使用以下 Console.WriteLine() 代码创建结果显示:Create a display for the results using the following Console.WriteLine() code:

Console.WriteLine($"=============== Single Prediction - Result: {prediction.Area} ===============");

结果Results

结果应如下所示。Your results should be similar to the following. 管道处理期间,会显示消息。As the pipeline processes, it displays messages. 你可能会看到警告或处理消息。You may see warnings, or processing messages. 为简便起见,已从以下结果中删除这些消息。These messages have been removed from the following results for clarity.

=============== Single Prediction just-trained-model - Result: area-System.Net ===============
*************************************************************************************************************
*       Metrics for Multi-class Classification model - Test Data
*------------------------------------------------------------------------------------------------------------
*       MicroAccuracy:    0.738
*       MacroAccuracy:    0.668
*       LogLoss:          .919
*       LogLossReduction: .643
*************************************************************************************************************
=============== Single Prediction - Result: area-System.Data ===============

祝贺你!Congratulations! 现在,已成功生成用于为 GitHub 问题分类和预测区域标签的机器学习模型。You've now successfully built a machine learning model for classifying and predicting an Area label for a GitHub issue. 可以在 dotnet/samples 存储库中找到本教程的源代码。You can find the source code for this tutorial at the dotnet/samples repository.

后续步骤Next steps

在本教程中,你将了解:In this tutorial, you learned how to:

  • 准备数据Prepare your data
  • 转换数据Transform the data
  • 定型模型Train the model
  • 评估模型Evaluate the model
  • 使用训练的模型预测Predict with the trained model
  • 使用加载模型部署和预测Deploy and Predict with a loaded model

进入下一教程了解详细信息Advance to the next tutorial to learn more