教程:使用矩阵因子分解和 ML.NET 生成影片推荐系统Tutorial: Build a movie recommender using matrix factorization with ML.NET

本教程演示如何在 .NET Core 控制台应用程序中使用 ML.NET 生成电影推荐系统。This tutorial shows you how to build a movie recommender with ML.NET in a .NET Core console application. 这些步骤使用 C# 和 Visual Studio 2019。The steps use C# and Visual Studio 2019.

在本教程中,你将了解:In this tutorial, you learn how to:

  • 选择机器学习算法Select a machine learning algorithm
  • 准备并加载数据Prepare and load your data
  • 生成并训练模型Build and train a model
  • 评估模型Evaluate a model
  • 部署和使用模型Deploy and consume a model

可以在 dotnet/samples 存储库中找到本教程的源代码。You can find the source code for this tutorial at the dotnet/samples repository.

机器学习工作流Machine learning workflow

你将使用以下步骤完成任务,以及任何其他 ML.NET 任务:You will use the following steps to accomplish your task, as well as any other ML.NET task:

  1. 加载数据Load your data
  2. 生成并训练模型Build and train your model
  3. 评估模型Evaluate your model
  4. 使用模型Use your model

先决条件Prerequisites

  • 安装了“.NET Core 跨平台开发”工作负载的 Visual Studio 2019 或更高版本或 Visual Studio 2017 版本 15.6 或更高版本。Visual Studio 2019 or later or Visual Studio 2017 version 15.6 or later with the ".NET Core cross-platform development" workload installed.

选择适当的机器学习任务Select the appropriate machine learning task

有几种方法可以解决推荐问题,如推荐影片列表或推荐相关产品列表,但此示例中将预测用户给予特定影片的评分 (1-5) 并在评分高于定义的阈值时推荐该影片(评分越高,用户喜欢特定电影的可能性就越大)。There are several ways to approach recommendation problems, such as recommending a list of movies or recommending a list of related products, but in this case you will predict what rating (1-5) a user will give to a particular movie and recommend that movie if it's higher than a defined threshold (the higher the rating, the higher the likelihood of a user liking a particular movie).

创建控制台应用程序Create a console application

创建项目Create a project

  1. 打开 Visual Studio 2017。Open Visual Studio 2017. 从菜单栏中选择“文件” > “新建” > “项目” 。Select File > New > Project from the menu bar. 在“新项目” 对话框中,依次选择“Visual C#” 和“.NET Core” 节点。In the New Project dialog, select the Visual C# node followed by the .NET Core node. 然后,选择“控制台应用程序(.NET Core)” 项目模板。Then select the Console App (.NET Core) project template. 在“名称”文本框中,键入“MovieRecommender”,然后选择“确定”按钮 。In the Name text box, type "MovieRecommender" and then select the OK button.

  2. 在项目中创建一个名为“数据”的目录来保存数据集文件 :Create a directory named Data in your project to store the data set:

    在“解决方案资源管理器”中,右键单击项目,然后选择“添加” > “新文件夹” 。In Solution Explorer, right-click the project and select Add > New Folder. 键入“Data”,然后按 Enter。Type "Data" and hit Enter.

  3. 安装“Microsoft.ML”和“Microsoft.ML.Recommender”NuGet 包 :Install the Microsoft.ML and Microsoft.ML.Recommender NuGet Packages:

    备注

    除非另有说明,否则本示例使用前面提到的 NuGet 包的最新稳定版本。This sample uses the latest stable version of the NuGet packages mentioned unless otherwise stated.

    在“解决方案资源管理器”中,右键单击项目,然后选择“管理 NuGet 包” 。In Solution Explorer, right-click the project and select Manage NuGet Packages. 选择“nuget.org”作为包源,然后选择“浏览”选项卡并搜索“Microsoft.ML”,在列表中选择包,再选择“安装”按钮 。Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML, select the package in the list, and select the Install button. 选择“预览更改” 对话框上的“确定” 按钮,如果你同意所列包的许可条款,则选择“接受许可” 对话框上的“我接受” 按钮。Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed. 对“Microsoft.ML.Recommender”重复这些步骤 。Repeat these steps for Microsoft.ML.Recommender.

  4. 在 Program.cs 文件的顶部添加以下 using 语句 :Add the following using statements at the top of your Program.cs file:

    using System;
    using System.IO;
    using Microsoft.ML;
    using Microsoft.ML.Trainers;
    

下载数据Download your data

  1. 下载两个数据集并将其保存到先前创建的“数据”文件夹中 :Download the two datasets and save them to the Data folder you previously created:

    • 右键单击 recommended-ratings-train.csv,然后选择“将链接(或目标)另存为...” Right click on recommendation-ratings-train.csv and select "Save Link (or Target) As..."

    • 右键单击 recommendation-ratings-test.csv,然后选择“将链接(或目标)另存为...” Right click on recommendation-ratings-test.csv and select "Save Link (or Target) As..."

      确保将 .csv 文件保存到“数据”文件夹,或者将其保存到其他位置后,将 .csv 文件移动到“数据”文件夹* * 。Make sure you either save the *.csv files to the Data folder, or after you save it elsewhere, move the *.csv files to the Data folder.

  2. 在“解决方案资源管理器”中,右键单击每个 *.csv 文件,然后选择“属性” 。In Solution Explorer, right-click each of the *.csv files and select Properties. 在“高级”下,将“复制到输出目录”的值更改为“如果较新则复制” 。Under Advanced, change the value of Copy to Output Directory to Copy if newer.

    如果在 VS 中较新,则用户选择“复制”的 GIF。

加载数据Load your data

ML.NET 过程的第一步是准备并加载用于训练和测试数据的模型。The first step in the ML.NET process is to prepare and load your model training and testing data.

建议分级数据分为 TrainTest 数据集。The recommendation ratings data is split into Train and Test datasets. Train 数据用于适应模型。The Train data is used to fit your model. Test 数据用于使用经过训练的模型进行预测并评估模型性能。The Test data is used to make predictions with your trained model and evaluate model performance. 通常使用 TrainTest 数据进行 80/20 拆分。It's common to have an 80/20 split with Train and Test data.

以下是 .csv 文件中数据的预览:*Below is a preview of the data from your *.csv files:

CVS 数据集预览的屏幕截图。

在 .csv 文件中,有四列:*In the *.csv files, there are four columns:

  • userId
  • movieId
  • rating
  • timestamp

在机器学习中,用于进行预测的列称为 Features,带有返回预测的列称为 LabelIn machine learning, the columns that are used to make a prediction are called Features, and the column with the returned prediction is called the Label.

想要预测影片评分,因此评分列为 LabelYou want to predict movie ratings, so the rating column is the Label. 其他三列,userIdmovieIdtimestamp 都用 Features 来预测 LabelThe other three columns, userId, movieId, and timestamp are all Features used to predict the Label.

特征Features LabelLabel
userId rating
movieId
timestamp

由你来决定使用哪个 Features 来预测 LabelIt's up to you to decide which Features are used to predict the Label. 你还可以使用排列特征重要性等方法来帮助选择最佳 FeaturesYou can also use methods like permutation feature importance to help with selecting the best Features.

在此示例中,应将 timestamp 列排除为 Feature,因为时间戳并不会真正影响用户对给定影片的评分方式,因此无法进行更准确的预测:In this case, you should eliminate the timestamp column as a Feature because the timestamp does not really affect how a user rates a given movie and thus would not contribute to making a more accurate prediction:

特征Features LabelLabel
userId rating
movieId

接下来,必须为输入类定义数据结构。Next you must define your data structure for the input class.

向项目添加一个新类:Add a new class to your project:

  1. 在“解决方案资源管理器”中,右键单击该项目,然后选择“添加”>“新项” 。In Solution Explorer, right-click the project, and then select Add > New Item.

  2. 在“添加新项”对话框中,选择“类”并将“名称”字段更改为“MovieRatingData.cs” 。In the Add New Item dialog box, select Class and change the Name field to MovieRatingData.cs. 然后,选择“添加” 按钮。Then, select the Add button.

“MovieRatingData.cs”文件随即在代码编辑器中打开 。The MovieRatingData.cs file opens in the code editor. 将下面的 using 语句添加到 MovieRatingData.cs 的顶部 :Add the following using statement to the top of MovieRatingData.cs:

using Microsoft.ML.Data;

通过删除现有的类定义并在 MovieRatingData.cs 中添加以下代码,创建一个名为 MovieRating 的类 :Create a class called MovieRating by removing the existing class definition and adding the following code in MovieRatingData.cs:

public class MovieRating
{
    [LoadColumn(0)]
    public float userId;
    [LoadColumn(1)]
    public float movieId;
    [LoadColumn(2)]
    public float Label;
}

MovieRating 指定输入数据类。MovieRating specifies an input data class. LoadColumn 属性指定应加载数据集中的哪些列(按列索引)。The LoadColumn attribute specifies which columns (by column index) in the dataset should be loaded. userIdmovieId 列是你的 Features(你将向模型提供预测 Label 的输入),而评分列是你将预测的 Label 模型的输出)。The userId and movieId columns are your Features (the inputs you will give the model to predict the Label), and the rating column is the Label that you will predict (the output of the model).

创建另一个类 MovieRatingPrediction,通过在 MovieRatingData.cs 中的 MovieRating 类之后添加以下代码来表示预测结果: Create another class, MovieRatingPrediction, to represent predicted results by adding the following code after the MovieRating class in MovieRatingData.cs:

public class MovieRatingPrediction
{
    public float Label;
    public float Score;
}

在 Program.cs 中,将 Console.WriteLine("Hello World!") 替换为 Main() 中的以下代码 :In Program.cs, replace the Console.WriteLine("Hello World!") with the following code inside Main():

MLContext mlContext = new MLContext();

执行所有 ML.NET 操作都是从 MLContext 类开始,初始化 mlContext 可创建一个新的 ML.NET 环境,可在模型创建工作流对象之间共享该环境。The MLContext class is a starting point for all ML.NET operations, and initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. 从概念上讲,它与实体框架中的 DBContext 类似。It's similar, conceptually, to DBContext in Entity Framework.

Main() 之后,创建一个名为 LoadData() 的方法:After Main(), create a method called LoadData():

public static (IDataView training, IDataView test) LoadData(MLContext mlContext)
{

}

备注

除非在以下步骤中添加返回语句,否则使用此方法将出错。This method will give you an error until you add a return statement in the following steps.

初始化数据路径变量、从 *.csv 文件加载数据以及将 TrainTest 数据作为 IDataView 对象返回,方法是在 LoadData() 中添加以下代码作为下一代码行:Initialize your data path variables, load the data from the *.csv files, and return the Train and Test data as IDataView objects by adding the following as the next line of code in LoadData():

var trainingDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "recommendation-ratings-train.csv");
var testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "recommendation-ratings-test.csv");

IDataView trainingDataView = mlContext.Data.LoadFromTextFile<MovieRating>(trainingDataPath, hasHeader: true, separatorChar: ',');
IDataView testDataView = mlContext.Data.LoadFromTextFile<MovieRating>(testDataPath, hasHeader: true, separatorChar: ',');

return (trainingDataView, testDataView);

ML.NET 中的数据表示为 IDataView 类Data in ML.NET is represented as an IDataView class. IDataView 是用于描述表格数据(数字和文本)的一种灵活且有效的方法。IDataView is a flexible, efficient way of describing tabular data (numeric and text). 可从文本文件或实时(例如,SQL 数据库或日志文件)将数据加载到 IDataView 对象。Data can be loaded from a text file or in real time (for example, SQL database or log files) to an IDataView object.

LoadFromTextFile() 用于定义数据架构并读取文件。The LoadFromTextFile() defines the data schema and reads in the file. 它使用数据路径变量并返回 IDataViewIt takes in the data path variables and returns an IDataView. 在这种情况下,需提供 TestTrain 文件的路径,并指示文本文件头(以便正确使用列名称)和逗号字符数据分隔符(默认分隔符是制表符)。In this case, you provide the path for your Test and Train files and indicate both the text file header (so it can use the column names properly) and the comma character data separator (the default separator is a tab).

Main() 方法中添加以下代码,以调用 LoadData() 方法并返回 TrainTest 数据:Add the following code in the Main() method to call your LoadData() method and return the Train and Test data:

(IDataView trainingDataView, IDataView testDataView) = LoadData(mlContext);

生成并训练模型Build and train your model

ML.NET 中包含三个主要概念:数据转换器估算器There are three major concepts in ML.NET: Data, Transformers, and Estimators.

机器学习训练算法需要特定格式的数据。Machine learning training algorithms require data in a certain format. Transformers 用于将表格数据转换为兼容格式。Transformers are used to transform tabular data to a compatible format.

转换器数据流的关系图。

可以通过创建 Estimators 在 ML.NET 中创建 TransformersYou create Transformers in ML.NET by creating Estimators. Estimators 接收数据并返回 TransformersEstimators take in data and return Transformers.

估算器数据流的关系图。

将用于训练模型的推荐训练算法就是一个 Estimator 示例。The recommendation training algorithm you will use for training your model is an example of an Estimator.

使用以下步骤生成 EstimatorBuild an Estimator with the following steps:

使用下面的代码紧随 LoadData() 方法后创建 BuildAndTrainModel() 方法:Create the BuildAndTrainModel() method, just after the LoadData() method, using the following code:

public static ITransformer BuildAndTrainModel(MLContext mlContext, IDataView trainingDataView)
{

}

备注

除非在以下步骤中添加返回语句,否则使用此方法将出错。This method will give you an error until you add a return statement in the following steps.

通过将以下代码添加到 BuildAndTrainModel() 来定义数据转换:Define the data transformations by adding the following code to BuildAndTrainModel():

IEstimator<ITransformer> estimator = mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "userIdEncoded", inputColumnName: "userId")
    .Append(mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "movieIdEncoded", inputColumnName: "movieId"));

由于 userIdmovieId 代表用户和影片标题,而不是实际值,因此使用 MapValueToKey() 方法将每个 userId 和每个 movieId 转换为数字键类型 Feature 列(推荐算法接受的格式)并将它们添加为新的数据集列:Since userId and movieId represent users and movie titles, not real values, you use the MapValueToKey() method to transform each userId and each movieId into a numeric key type Feature column (a format accepted by recommendation algorithms) and add them as new dataset columns:

userIduserId movieIdmovieId LabelLabel userIdEncodeduserIdEncoded movieIdEncodedmovieIdEncoded
11 11 44 userKey1userKey1 movieKey1movieKey1
11 33 44 userKey1userKey1 movieKey2movieKey2
11 66 44 userKey1userKey1 movieKey3movieKey3

选择机器学习算法并将其添加到数据转换定义中,方法是在 BuildAndTrainModel() 中添加以下代码作为下一代码行:Choose the machine learning algorithm and append it to the data transformation definitions by adding the following as the next line of code in BuildAndTrainModel():

var options = new MatrixFactorizationTrainer.Options
{
    MatrixColumnIndexColumnName = "userIdEncoded",
    MatrixRowIndexColumnName = "movieIdEncoded",
    LabelColumnName = "Label",
    NumberOfIterations = 20,
    ApproximationRank = 100
};

var trainerEstimator = estimator.Append(mlContext.Recommendation().Trainers.MatrixFactorization(options));

MatrixFactorizationTrainer 就是推荐训练算法。The MatrixFactorizationTrainer is your recommendation training algorithm. 当你掌握用户过去如何评价产品的数据时,通常建议使用矩阵分解方法,本教程中的数据集就是这种情况。Matrix Factorization is a common approach to recommendation when you have data on how users have rated products in the past, which is the case for the datasets in this tutorial. 当你有不同的数据时,还可使用其他推荐算法(请参阅下面的其他推荐算法部分以了解更多信息)。There are other recommendation algorithms for when you have different data available (see the Other recommendation algorithms section below to learn more).

在本例中,Matrix Factorization 算法使用了一种称为“协作筛选”的方法,该方法假设如果用户 1 在某个问题上与用户 2 有相同的观点,那么用户 1 更有可能与用户 2 在另一个问题上有相同的看法。In this case, the Matrix Factorization algorithm uses a method called "collaborative filtering", which assumes that if User 1 has the same opinion as User 2 on a certain issue, then User 1 is more likely to feel the same way as User 2 about a different issue.

例如,如果用户 1 和用户 2 对影片的评分相似,那么用户 2 更有可能欣赏用户 1 已观看并给出很高评分的影片:For instance, if User 1 and User 2 rate movies similarly, then User 2 is more likely to enjoy a movie that User 1 has watched and rated highly:

Incredibles 2 (2018) The Avengers (2012) Guardians of the Galaxy (2014)
用户 1User 1 观看和点赞过的影片Watched and liked movie 观看和点赞过的影片Watched and liked movie 观看和点赞过的影片Watched and liked movie
用户 2User 2 观看和点赞过的影片Watched and liked movie 观看和点赞过的影片Watched and liked movie 没有看过 - 推荐影片Has not watched -- RECOMMEND movie

Matrix Factorization 训练程序有多个选项,可在下面的算法超参数部分中详细了解。The Matrix Factorization trainer has several Options, which you can read more about in the Algorithm hyperparameters section below.

BuildAndTrainModel() 方法中添加以下代码作为下一代码行,使模型适应 Train 数据,并返回经过训练的模型:Fit the model to the Train data and return the trained model by adding the following as the next line of code in the BuildAndTrainModel() method:

Console.WriteLine("=============== Training the model ===============");
ITransformer model = trainerEstimator.Fit(trainingDataView);

return model;

Fit() 方法使用提供的训练数据集训练模型。The Fit() method trains your model with the provided training dataset. 从技术上讲,该方法通过转换数据并应用训练来执行 Estimator 定义,然后返回经过训练的模型,即 TransformerTechnically, it executes the Estimator definitions by transforming the data and applying the training, and it returns back the trained model, which is a Transformer.

将以下内容添加为 Main() 方法中的下一代码行,以调用 BuildAndTrainModel() 方法并返回经过训练的模型:Add the following as the next line of code in the Main() method to call your BuildAndTrainModel() method and return the trained model:

ITransformer model = BuildAndTrainModel(mlContext, trainingDataView);

评估模型Evaluate your model

训练模型后,使用测试数据评估模型的执行情况。Once you have trained your model, use your test data to evaluate how your model is performing.

使用下面的代码紧随 BuildAndTrainModel() 方法后创建 EvaluateModel() 方法:Create the EvaluateModel() method, just after the BuildAndTrainModel() method, using the following code:

public static void EvaluateModel(MLContext mlContext, IDataView testDataView, ITransformer model)
{

}

将以下代码添加到 EvaluateModel() 以转换 Test 数据:Transform the Test data by adding the following code to EvaluateModel():

Console.WriteLine("=============== Evaluating the model ===============");
var prediction = model.Transform(testDataView);

Transform() 方法对测试数据集的多个提供的输入行进行预测。The Transform() method makes predictions for multiple provided input rows of a test dataset.

通过在 EvaluateModel() 方法中添加以下代码作为下一代码行来评估模型:Evaluate the model by adding the following as the next line of code in the EvaluateModel() method:

var metrics = mlContext.Regression.Evaluate(prediction, labelColumnName: "Label", scoreColumnName: "Score");

获得预测集后,Evaluate() 方法会对模型进行评估,该模型会将预测值与测试数据集中的实际 Labels 进行比较,并返回有关模型执行情况的指标。Once you have the prediction set, the Evaluate() method assesses the model, which compares the predicted values with the actual Labels in the test dataset and returns metrics on how the model is performing.

EvaluateModel() 方法中添加以下代码作为下一代码行,将评估指标输出到控制台:Print your evaluation metrics to the console by adding the following as the next line of code in the EvaluateModel() method:

Console.WriteLine("Root Mean Squared Error : " + metrics.RootMeanSquaredError.ToString());
Console.WriteLine("RSquared: " + metrics.RSquared.ToString());

Main() 方法中添加以下代码作为下一代码行,调用 EvaluateModel() 方法:Add the following as the next line of code in the Main() method to call your EvaluateModel() method:

EvaluateModel(mlContext, testDataView, model);

到目前为止的输出应类似于以下文本:The output so far should look similar to the following text:

=============== Training the model ===============
iter      tr_rmse          obj
   0       1.5403   3.1262e+05
   1       0.9221   1.6030e+05
   2       0.8687   1.5046e+05
   3       0.8416   1.4584e+05
   4       0.8142   1.4209e+05
   5       0.7849   1.3907e+05
   6       0.7544   1.3594e+05
   7       0.7266   1.3361e+05
   8       0.6987   1.3110e+05
   9       0.6751   1.2948e+05
  10       0.6530   1.2766e+05
  11       0.6350   1.2644e+05
  12       0.6197   1.2541e+05
  13       0.6067   1.2470e+05
  14       0.5953   1.2382e+05
  15       0.5871   1.2342e+05
  16       0.5781   1.2279e+05
  17       0.5713   1.2240e+05
  18       0.5660   1.2230e+05
  19       0.5592   1.2179e+05
=============== Evaluating the model ===============
Rms: 0.994051469730769
RSquared: 0.412556298844873

在此输出中,有 20 次迭代。In this output, there are 20 iterations. 在每次迭代中,误差测量值均会减小并逐渐趋于最小值 0。In each iteration, the measure of error decreases and converges closer and closer to 0.

root of mean squared error(RMS 或 RMSE)用于度量模型预测的值与测试数据集观察到的值之间的差异。The root of mean squared error (RMS or RMSE) is used to measure the differences between the model predicted values and the test dataset observed values. 从技术上讲,它是误差的平方的平均值的平方根。Technically it's the square root of the average of the squares of the errors. 指标越低,模型就越好。The lower it is, the better the model is.

R Squared 指明数据与模型的适应程度。R Squared indicates how well data fits a model. 范围从 0 到 1。Ranges from 0 to 1. 值 0 表示数据是随机的,否则就无法适应模型。A value of 0 means that the data is random or otherwise can't be fit to the model. 值 1 表示模型与数据完全匹配。A value of 1 means that the model exactly matches the data. 通常会希望 R Squared 分数尽可能接近 1。You want your R Squared score to be as close to 1 as possible.

生成成功的模型是一个迭代过程。Building successful models is an iterative process. 由于本教程使用小型数据集来提供快速模型训练,因此该模型的初始质量较低。This model has initial lower quality as the tutorial uses small datasets to provide quick model training. 如果对模型质量不满意,可以通过尝试提供更大的训练数据集,或通过为每种算法选择具有不同超参数的不同训练算法来改进它。If you aren't satisfied with the model quality, you can try to improve it by providing larger training datasets or by choosing different training algorithms with different hyper-parameters for each algorithm. 有关详细信息,请查看下面的改进模型部分。For more information, check out the Improve your model section below.

使用模型Use your model

现在,你可以使用经过训练的模型对新数据进行预测。Now you can use your trained model to make predictions on new data.

使用下面的代码紧随 EvaluateModel() 方法后创建 UseModelForSinglePrediction() 方法:Create the UseModelForSinglePrediction() method, just after the EvaluateModel() method, using the following code:

public static void UseModelForSinglePrediction(MLContext mlContext, ITransformer model)
{

}

使用 PredictionEngine 通过将以下代码添加到 UseModelForSinglePrediction() 来预测评分:Use the PredictionEngine to predict the rating by adding the following code to UseModelForSinglePrediction():

Console.WriteLine("=============== Making a prediction ===============");
var predictionEngine = mlContext.Model.CreatePredictionEngine<MovieRating, MovieRatingPrediction>(model);

PredictionEngine 是一个简便 API,可使用它对单个数据实例执行预测。The PredictionEngine is a convenience API, which allows you to perform a prediction on a single instance of data. PredictionEngine 不是线程安全型。PredictionEngine is not thread-safe. 可以在单线程环境或原型环境中使用。It's acceptable to use in single-threaded or prototype environments. 为了在生产环境中提高性能和线程安全,请使用 PredictionEnginePool 服务,这将创建一个在整个应用程序中使用的 PredictionEngine 对象的 ObjectPoolFor improved performance and thread safety in production environments, use the PredictionEnginePool service, which creates an ObjectPool of PredictionEngine objects for use throughout your application. 请参阅本指南,了解如何在 ASP.NET Core Web API 中使用 PredictionEnginePoolSee this guide on how to use PredictionEnginePool in an ASP.NET Core Web API.

备注

PredictionEnginePool 服务扩展目前处于预览状态。PredictionEnginePool service extension is currently in preview.

创建一个名为 testInputMovieRating 实例,并通过在 UseModelForSinglePrediction() 方法中添加以下代码作为下一代码行,将其传递给预测引擎:Create an instance of MovieRating called testInput and pass it to the Prediction Engine by adding the following as the next lines of code in the UseModelForSinglePrediction() method:

var testInput = new MovieRating { userId = 6, movieId = 10 };

var movieRatingPrediction = predictionEngine.Predict(testInput);

Predict() 函数对单列数据进行预测。The Predict() function makes a prediction on a single column of data.

然后,你可以使用 Score 或预测评分来确定是否要将 movieId 10 的影片推荐给用户 6。You can then use the Score, or the predicted rating, to determine whether you want to recommend the movie with movieId 10 to user 6. Score 越高,用户喜欢特定电影的可能性就越大。The higher the Score, the higher the likelihood of a user liking a particular movie. 在这种情况下,假设你推荐预测评分大于 3.5 的电影。In this case, let’s say that you recommend movies with a predicted rating of > 3.5.

若要输出结果,请在 UseModelForSinglePrediction() 方法中添加以下代码作为下一代码行:To print the results, add the following as the next lines of code in the UseModelForSinglePrediction() method:

if (Math.Round(movieRatingPrediction.Score, 1) > 3.5)
{
    Console.WriteLine("Movie " + testInput.movieId + " is recommended for user " + testInput.userId);
}
else
{
    Console.WriteLine("Movie " + testInput.movieId + " is not recommended for user " + testInput.userId);
}

Main() 方法中添加以下代码作为下一代码行,调用 UseModelForSinglePrediction() 方法:Add the following as the next line of code in the Main() method to call your UseModelForSinglePrediction() method:

UseModelForSinglePrediction(mlContext, model);

此方法的输出应类似于以下文本:The output of this method should look similar to the following text:

=============== Making a prediction ===============
Movie 10 is recommended for user 6

保存模型Save your model

若要使用模型在最终用户应用程序中进行预测,必须先保存模型。To use your model to make predictions in end-user applications, you must first save the model.

使用下面的代码紧随 UseModelForSinglePrediction() 方法后创建 SaveModel() 方法:Create the SaveModel() method, just after the UseModelForSinglePrediction() method, using the following code:

public static void SaveModel(MLContext mlContext, DataViewSchema trainingDataViewSchema, ITransformer model)
{

}

通过在 SaveModel() 方法中添加以下代码来保存经过训练的模型:Save your trained model by adding the following code in the SaveModel() method:

var modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "MovieRecommenderModel.zip");

Console.WriteLine("=============== Saving the model to a file ===============");
mlContext.Model.Save(model, trainingDataViewSchema, modelPath);

此方法会将经过训练的模型保存到 .zip 文件(在“数据”文件夹中),然后可以在其他 .NET 应用程序中使用该文件进行预测。This method saves your trained model to a .zip file (in the "Data" folder), which can then be used in other .NET applications to make predictions.

Main() 方法中添加以下代码作为下一代码行,调用 SaveModel() 方法:Add the following as the next line of code in the Main() method to call your SaveModel() method:

SaveModel(mlContext, trainingDataView.Schema, model);

使用保存的模型Use your saved model

保存已定型模型后,可以在不同的环境中使用该模型。Once you have saved your trained model, you can consume the model in different environments. 请参阅保存和加载已定型模型,了解如何在应用中操作定型的机器学习模型。See Save and load trained models to learn how to operationalize a trained machine learning model in apps.

结果Results

按照上述步骤操作后,运行控制台应用程序 (Ctrl + F5)。After following the steps above, run your console app (Ctrl + F5). 上述单一预测的结果应与以下内容类似。Your results from the single prediction above should be similar to the following. 你可能会看到警告或处理消息,为清楚起见,这些消息已从以下结果中删除。You may see warnings or processing messages, but these messages have been removed from the following results for clarity.

=============== Training the model ===============
iter      tr_rmse          obj
   0       1.5382   3.1213e+05
   1       0.9223   1.6051e+05
   2       0.8691   1.5050e+05
   3       0.8413   1.4576e+05
   4       0.8145   1.4208e+05
   5       0.7848   1.3895e+05
   6       0.7552   1.3613e+05
   7       0.7259   1.3357e+05
   8       0.6987   1.3121e+05
   9       0.6747   1.2949e+05
  10       0.6533   1.2766e+05
  11       0.6353   1.2636e+05
  12       0.6209   1.2561e+05
  13       0.6072   1.2462e+05
  14       0.5965   1.2394e+05
  15       0.5868   1.2352e+05
  16       0.5782   1.2279e+05
  17       0.5713   1.2227e+05
  18       0.5637   1.2190e+05
  19       0.5604   1.2178e+05
=============== Evaluating the model ===============
Rms: 0.977175077487166
RSquared: 0.43233349213192
=============== Making a prediction ===============
Movie 10 is recommended for user 6
=============== Saving the model to a file ===============

祝贺你!Congratulations! 现已成功构建了用于推荐影片的机器学习模型。You've now successfully built a machine learning model for recommending movies. 可以在 dotnet/samples 存储库中找到本教程的源代码。You can find the source code for this tutorial at the dotnet/samples repository.

提升模型Improve your model

有几种方法可以提升模型的性能,以便可以获得更准确的预测。There are several ways that you can improve the performance of your model so that you can get more accurate predictions.

数据Data

可添加更多训练数据,并在其中包括针对每个用户和影片 ID 的足够样本,以帮助提升推荐模型的质量。Adding more training data that has enough samples for each user and movie id can help improve the quality of the recommendation model.

交叉验证是一种评估模型的方法,它将数据随机分成子集(而不是像你在本教程中那样从数据集中提取测试数据),并将一些组作为训练数据,一些组作为测试数据。Cross validation is a technique for evaluating models that randomly splits up data into subsets (instead of extracting out test data from the dataset like you did in this tutorial) and takes some of the groups as train data and some of the groups as test data. 从模型质量方面看,该方法优于进行训练-测试拆分。This method outperforms making a train-test split in terms of model quality.

特征Features

在本教程中,只使用数据集提供的三个 Featuresuser idmovie idrating)。In this tutorial, you only use the three Features (user id, movie id, and rating) that are provided by the dataset.

虽然这是一个良好的开端,但实际上你可能希望添加其他属性或 Features(例如,年龄、性别、地理位置等),如果它们包含在数据集中。While this is a good start, in reality you might want to add other attributes or Features (for example, age, gender, geo-location, etc.) if they are included in the dataset. 添加更相关的 Features 有助于提升推荐模型的性能。Adding more relevant Features can help improve the performance of your recommendation model.

如果你不确定哪个 Features 可能与机器学习任务最相关,还可以使用 ML.NET 提供的特征贡献计算 (FCC) 和排列特征重要性来发现最有影响力的 FeaturesIf you are unsure about which Features might be the most relevant for your machine learning task, you can also make use of Feature Contribution Calculation (FCC) and permutation feature importance, which ML.NET provides to discover the most influential Features.

算法超参数Algorithm hyperparameters

虽然 ML.NET 提供了良好的默认训练算法,但可以通过更改算法的超参数来进一步微调性能。While ML.NET provides good default training algorithms, you can further fine-tune performance by changing the algorithm's hyperparameters.

对于 Matrix Factorization,可尝试使用超参数,例如 NumberOfIterationsApproximationRank 来查看是否可以获得更好的结果。For Matrix Factorization, you can experiment with hyperparameters such as NumberOfIterations and ApproximationRank to see if that gives you better results.

例如,在本教程中,算法选项是:For instance, in this tutorial the algorithm options are:

var options = new MatrixFactorizationTrainer.Options
{
    MatrixColumnIndexColumnName = "userIdEncoded",
    MatrixRowIndexColumnName = "movieIdEncoded",
    LabelColumnName = "Label",
    NumberOfIterations = 20,
    ApproximationRank = 100
};

其他推荐算法Other Recommendation Algorithms

具有协作筛选的矩阵分解算法只是用于执行影片推荐的一种方法。The matrix factorization algorithm with collaborative filtering is only one approach for performing movie recommendations. 在许多情况下,可能没有可用的评分数据,并且只有用户可以获得影片历史记录。In many cases, you may not have the ratings data available and only have movie history available from users. 在其他情况下,你可能不仅仅拥有用户的评分数据。In other cases, you may have more than just the user’s rating data.

算法Algorithm 方案Scenario 示例Sample
一类矩阵分解One Class Matrix Factorization 当只有 userId 和 movieId 时使用此选项。Use this when you only have userId and movieId. 这种推荐方式基于共同购买方案或经常一起购买的产品,这意味着它将根据自己的采购订单历史记录向客户推荐一组产品。This style of recommendation is based upon the co-purchase scenario, or products frequently bought together, which means it will recommend to customers a set of products based upon their own purchase order history. >试试吧>Try it out
场感知分解机Field Aware Factorization Machines 当拥有的特征不止 userId、productId 和评分(例如产品描述或产品价格)时,可使用此选项进行建议。Use this to make recommendations when you have more Features beyond userId, productId, and rating (such as product description or product price). 此方法也使用协作筛选法。This method also uses a collaborative filtering approach. >试试吧>Try it out

新用户方案New user scenario

协作筛选中的一个常见问题是“冷开始问题”,即有一个新用户,没有用于进行推理的任何旧数据。One common problem in collaborative filtering is the cold start problem, which is when you have a new user with no previous data to draw inferences from. 该问题通常可通过要求新用户创建个人资料来解决,例如,对他们过去看过的影片评分。This problem is often solved by asking new users to create a profile and, for instance, rate movies they have seen in the past. 虽然此方法会给用户带来一些负担,但它可为没有评分历史记录的新用户提供一些开始数据。While this method puts some burden on the user, it provides some starting data for new users with no rating history.

资源Resources

本教程中使用的数据源自 MovieLens 数据集The data used in this tutorial is derived from MovieLens Dataset.

后续步骤Next steps

在本教程中,你将了解:In this tutorial, you learned how to:

  • 选择机器学习算法Select a machine learning algorithm
  • 准备并加载数据Prepare and load your data
  • 生成并训练模型Build and train a model
  • 评估模型Evaluate a model
  • 部署和使用模型Deploy and consume a model

进入下一教程了解详细信息Advance to the next tutorial to learn more