训练和评估模型Train and evaluate a model

了解如何使用 ML.NET 生成机器学习模型、收集指标以及测量性能。Learn how to build machine learning models, collect metrics, and measure performance with ML.NET. 虽然此示例训练回归模型,但这些概念适用于大部分其他算法。Although this sample trains a regression model, the concepts are applicable throughout a majority of the other algorithms.

拆分数据用于训练和测试Split data for training and testing

机器学习模型旨在识别训练数据中的模式。The goal of a machine learning model is to identify patterns within training data. 这些模式用于使用新数据进行预测。These patterns are used to make predictions using new data.

数据可以通过 HousingData 等类进行建模。The data can be modeled by a class like HousingData.

public class HousingData
    public float Size { get; set; }

    [LoadColumn(1, 3)]
    public float[] HistoricalPrices { get; set; }

    public float CurrentPrice { get; set; }

提供加载到 IDataView 中的以下数据。Given the following data which is loaded into an IDataView.

HousingData[] housingData = new HousingData[]
    new HousingData
        Size = 600f,
        HistoricalPrices = new float[] { 100000f ,125000f ,122000f },
        CurrentPrice = 170000f
    new HousingData
        Size = 1000f,
        HistoricalPrices = new float[] { 200000f, 250000f, 230000f },
        CurrentPrice = 225000f
    new HousingData
        Size = 1000f,
        HistoricalPrices = new float[] { 126000f, 130000f, 200000f },
        CurrentPrice = 195000f
    new HousingData
        Size = 850f,
        HistoricalPrices = new float[] { 150000f,175000f,210000f },
        CurrentPrice = 205000f
    new HousingData
        Size = 900f,
        HistoricalPrices = new float[] { 155000f, 190000f, 220000f },
        CurrentPrice = 210000f
    new HousingData
        Size = 550f,
        HistoricalPrices = new float[] { 99000f, 98000f, 130000f },
        CurrentPrice = 180000f

使用 TrainTestSplit 方法将数据拆分为训练集和测试集。Use the TrainTestSplit method to split the data into train and test sets. 结果将是一个 TrainTestData 对象,其中包含两个 IDataView 成员,一个用于训练集,另一个用于测试集。The result will be a TrainTestData object which contains two IDataView members, one for the train set and the other for the test set. 数据拆分百分比由 testFraction 参数确定。The data split percentage is determined by the testFraction parameter. 下面的代码片段让测试集占用 20% 的原始数据。The snippet below is holding out 20 percent of the original data for the test set.

DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;

准备数据Prepare the data

在训练机器学习模型之前,需要对数据进行预处理。The data needs to be pre-processed before training a machine learning model. 有关数据准备的详细信息,请参阅数据准备操作说明文章以及transforms pageMore information on data preparation can be found on the data prep how-to article as well as the transforms page.

ML.NET 算法对输入列类型存在约束。ML.NET algorithms have constraints on input column types. 此外,如果未指定任何值,则默认值会用于输入和输出列名。Additionally, default values are used for input and output column names when no values are specified.

使用预期的列类型Working with expected column types

ML.NET 中的机器学习算法预期使用大小已知的浮点向量作为输入。The machine learning algorithms in ML.NET expect a float vector of known size as input. 当所有数据都已经是数字格式并且打算一起处理(即图像像素)时,将 VectorType 属性应用于数据模型。Apply the VectorType attribute to your data model when all of the data is already in numerical format and is intended to be processed together (i.e. image pixels).

如果数据不全为数字格式,并且想要单独对每个列应用不同的数据转换,请在处理所有列后使用 Concatenate 方法,以将所有单独的列合并为一个特征向量并将特征向量输出到新列。If data is not all numerical and you want to apply different data transformations on each of the columns individually, use the Concatenate method after all of the columns have been processed to combine all of the individual columns into a single feature vector that is output to a new column.

以下代码片段将 SizeHistoricalPrices 列合并为一个特征向量,该特征向量输出到名为 Features 的新列。The following snippet combines the Size and HistoricalPrices columns into a single feature vector that is output to a new column called Features. 由于比例存在差异,将 NormalizeMinMax 应用于 Features 列来规范化数据。Because there is a difference in scales, NormalizeMinMax is applied to the Features column to normalize the data.

// Define Data Prep Estimator
// 1. Concatenate Size and Historical into a single feature vector output to a new column called Features
// 2. Normalize Features vector
IEstimator<ITransformer> dataPrepEstimator =
    mlContext.Transforms.Concatenate("Features", "Size", "HistoricalPrices")

// Create data prep transformer
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(trainData);

// Apply transforms to training data
IDataView transformedTrainingData = dataPrepTransformer.Transform(trainData);

使用默认列名Working with default column names

未指定列名时,ML.NET 算法会使用默认列名。ML.NET algorithms use default column names when none are specified. 所有训练程序都有一个名为 featureColumnName 的参数可用于算法的输入,并且在适用情况下,它们还有一个用于预期值的名为 labelColumnName 的参数。All trainers have a parameter called featureColumnName for the inputs of the algorithm and when applicable they also have a parameter for the expected value called labelColumnName. 默认情况下,这些值分别为 FeaturesLabelBy default those values are Features and Label respectively.

通过在预处理期间使用 Concatenate 方法创建名为 Features 的新列,无需在算法的参数中指定特征列名,因为它已存在于预处理的 IDataView 中。By using the Concatenate method during pre-processing to create a new column called Features, there is no need to specify the feature column name in the parameters of the algorithm since it already exists in the pre-processed IDataView. 标签列为 CurrentPrice,但由于数据模型中使用了 ColumnName 属性,ML.NET 将 CurrentPrice 列重命名为 Label,因而无需向机器学习算法估算器提供 labelColumnName 参数。The label column is CurrentPrice, but since the ColumnName attribute is used in the data model, ML.NET renames the CurrentPrice column to Label which removes the need to provide the labelColumnName parameter to the machine learning algorithm estimator.

如果不想使用默认列名,请在定义机器学习算法估算器时将特征和标签列的名称作为参数传入,如以下代码片段所示:If you don't want to use the default column names, pass in the names of the feature and label columns as parameters when defining the machine learning algorithm estimator as demonstrated by the subsequent snippet:

var UserDefinedColumnSdcaEstimator = mlContext.Regression.Trainers.Sdca(labelColumnName: "MyLabelColumnName", featureColumnName: "MyFeatureColumnName");

缓存数据Caching data

默认情况下,在处理数据时,数据会延迟加载或流式传输,这意味着训练程序可以从磁盘加载数据,并在训练期间多次循环访问数据。By default, when data is processed, it is lazily loaded or streamed which means that trainers may load the data from disk and iterate over it multiple times during training. 因此,建议对放入内存中的数据集进行缓存,以减少从磁盘加载数据的次数。Therefore, caching is recommended for datasets that fit into memory to reduce the number of times data is loaded from disk. 缓存使用 AppendCacheCheckpoint 作为 EstimatorChain 的一部分来完成。Caching is done as part of an EstimatorChain by using AppendCacheCheckpoint.

建议在任何训练程序处于管道中之前,使用 AppendCacheCheckpointIt's recommended to use AppendCacheCheckpoint before any trainers in the pipeline.

使用以下 EstimatorChain,在 StochasticDualCoordinateAscent 训练程序之前添加 AppendCacheCheckpoint 可缓存以前估算器的结果以供训练程序以后使用。Using the following EstimatorChain, adding AppendCacheCheckpoint before the StochasticDualCoordinateAscent trainer caches the results of the previous estimators for later use by the trainer.

// 1. Concatenate Size and Historical into a single feature vector output to a new column called Features
// 2. Normalize Features vector
// 3. Cache prepared data
// 4. Use Sdca trainer to train the model
IEstimator<ITransformer> dataPrepEstimator =
    mlContext.Transforms.Concatenate("Features", "Size", "HistoricalPrices")

训练机器学习模型Train the machine learning model

对数据进行预处理后,使用 Fit 方法通过 StochasticDualCoordinateAscent 回归算法训练机器学习模型。Once the data is pre-processed, use the Fit method to train the machine learning model with the StochasticDualCoordinateAscent regression algorithm.

// Define StochasticDualCoordinateAscent regression algorithm estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();

// Build machine learning model
var trainedModel = sdcaEstimator.Fit(transformedTrainingData);

提取模型参数Extract model parameters

训练模型后,提取已学习的 ModelParameters 用于检查或重新训练。After the model has been trained, extract the learned ModelParameters for inspection or retraining. LinearRegressionModelParameters 提供经过训练的模型的偏差和已学习的系数或权重。The LinearRegressionModelParameters provide the bias and learned coefficients or weights of the trained model.

var trainedModelParameters = trainedModel.Model as LinearRegressionModelParameters;


其他模型具有特定于其任务的参数。Other models have parameters that are specific to their tasks. 例如,K-Means 算法基于形心将数据放入群集中,KMeansModelParameters 包含存储这些已学习的形心的属性。For example, the K-Means algorithm puts data into cluster based on centroids and the KMeansModelParameters contains a property that stores these learned centroids. 若要了解详细信息,请访问 Microsoft.ML.Trainers API 文档并查找名称中包含 ModelParameters 的类。To learn more, visit the Microsoft.ML.Trainers API Documentation and look for classes that contain ModelParameters in their name.

评估模型质量Evaluate model quality

若要帮助选择性能最佳的模型,必须评估其在测试数据中的性能。To help choose the best performing model, it is essential to evaluate its performance on test data. 使用 Evaluate 方法测量经过训练的模型的各种指标。Use the Evaluate method, to measure various metrics for the trained model.


Evaluate 方法根据执行的机器学习任务生成不同的指标。The Evaluate method produces different metrics depending on which machine learning task was performed. 有关更多详细信息,请访问 Microsoft.ML.Data API 文档并查找名称中包含 Metrics 的类。For more details, visit the Microsoft.ML.Data API Documentation and look for classes that contain Metrics in their name.

// Measure trained model performance
// Apply data prep transformer to test data
IDataView transformedTestData = dataPrepTransformer.Transform(testData);

// Use trained model to make inferences on test data
IDataView testDataPredictions = trainedModel.Transform(transformedTestData);

// Extract model metrics and get RSquared
RegressionMetrics trainedModelMetrics = mlContext.Regression.Evaluate(testDataPredictions);
double rSquared = trainedModelMetrics.RSquared;

在上一代码示例中:In the previous code sample:

  1. 测试数据集使用之前定义的数据准备转换进行预处理。Test data set is pre-processed using the data preparation transforms previously defined.
  2. 经过训练的机器学习模型用于对测试数据进行预测。The trained machine learning model is used to make predictions on the test data.
  3. Evaluate 方法中,将测试数据集 CurrentPrice 列中的值与新输出预测的 Score 列进行比较,以计算回归模型的指标,其中之一是 R 平方,它存储在 rSquared 变量中。In the Evaluate method, the values in the CurrentPrice column of the test data set are compared against the Score column of the newly output predictions to calculate the metrics for the regression model, one of which, R-Squared is stored in the rSquared variable.


在这一小型示例中,由于数据大小存在限制,R 平方是一个不在 0-1 范围内的数字。In this small example, the R-Squared is a number not in the range of 0-1 because of the limited size of the data. 在实际方案中,应预期看到介于 0 和 1 之间的值。In a real-world scenario, you should expect to see a value between 0 and 1.