定型和評估模型Train and evaluate a model

了解如何建置機器學習模型、收集計量,以及使用 ML.NET 測量效能。Learn how to build machine learning models, collect metrics, and measure performance with ML.NET. 雖然此範例訓練的是迴歸模型,但概念可適用於大多數的其他演算法。Although this sample trains a regression model, the concepts are applicable throughout a majority of the other algorithms.

分割資料以進行定型和測試Split data for training and testing

機器學習模型的目標是識別定型資料中模式。The goal of a machine learning model is to identify patterns within training data. 這些模式會用來使用新資料進行預測。These patterns are used to make predictions using new data.

假設下列資料模型:Given the following data model:

public class HousingData
{
    [LoadColumn(0)]
    public float Size { get; set; }

    [LoadColumn(1, 3)]
    [VectorType(3)]
    public float[] HistoricalPrices { get; set; }

    [LoadColumn(4)]
    [ColumnName("Label")]
    public float CurrentPrice { get; set; }
}

將資料載入 IDataViewLoad the data into an IDataView:

HousingData[] housingData = new HousingData[]
{
    new HousingData
    {
        Size = 600f,
        HistoricalPrices = new float[] { 100000f ,125000f ,122000f },
        CurrentPrice = 170000f
    },
    new HousingData
    {
        Size = 1000f,
        HistoricalPrices = new float[] { 200000f, 250000f, 230000f },
        CurrentPrice = 225000f
    },
    new HousingData
    {
        Size = 1000f,
        HistoricalPrices = new float[] { 126000f, 130000f, 200000f },
        CurrentPrice = 195000f
    },
    new HousingData
    {
        Size = 850f,
        HistoricalPrices = new float[] { 150000f,175000f,210000f },
        CurrentPrice = 205000f
    },
    new HousingData
    {
        Size = 900f,
        HistoricalPrices = new float[] { 155000f, 190000f, 220000f },
        CurrentPrice = 210000f
    },
    new HousingData
    {
        Size = 550f,
        HistoricalPrices = new float[] { 99000f, 98000f, 130000f },
        CurrentPrice = 180000f
    }
};

使用 TrainTestSplit 方法將資料分割成定型及測試集。Use the TrainTestSplit method to split the data into train and test sets. 其結果將會是一個 TrainTestData 物件,其中包含兩個 IDataView 成員,一個用於定型集,一個用於測試集。The result will be a TrainTestData object which contains two IDataView members, one for the train set and the other for the test set. 資料分割百分比是由 testFraction 參數所決定。The data split percentage is determined by the testFraction parameter. 以下程式碼片段會保留原始資料的 20% 作為測試集。The snippet below is holding out 20 percent of the original data for the test set.

DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;

準備資料Prepare the data

資料需要進行預先處理,才能定型機器學習模型。The data needs to be pre-processed before training a machine learning model. 資料準備的詳細資訊可在資料準備操作說明文章,以及 transforms page 中找到。More information on data preparation can be found on the data prep how-to article as well as the transforms page.

ML.NET 演算法針對輸入資料行類型具有條件約束。ML.NET algorithms have constraints on input column types. 此外,若沒有指定任何值,則會使用預設值作為輸入和輸出資料行名稱。Additionally, default values are used for input and output column names when no values are specified.

使用預期的資料行類型Working with expected column types

ML.NET 中的機器學習演算法預期收到大小已知浮動向量作為輸入。The machine learning algorithms in ML.NET expect a float vector of known size as input. 當所有資料都已是數字格式,且可一起進行處理時 (例如影像像素),請將 VectorType 屬性套用到您的資料模型。Apply the VectorType attribute to your data model when all of the data is already in numerical format and is intended to be processed together (i.e. image pixels).

若資料並非全部都是數字格式,且您希望為每個資料行個別套用不同的資料轉換時,請在所有資料行都已經過處理後使用 Concatenate 方法來將所有個別資料行合併成單一特徵向量,輸出到新資料行。If data is not all numerical and you want to apply different data transformations on each of the columns individually, use the Concatenate method after all of the columns have been processed to combine all of the individual columns into a single feature vector that is output to a new column.

下列程式碼片段會將 SizeHistoricalPrices 資料行合併成單一特徵向量,輸出到稱為 Features 的新資料行。The following snippet combines the Size and HistoricalPrices columns into a single feature vector that is output to a new column called Features. 因為規模不同,NormalizeMinMax 會套用到 Features 資料行以正常化資料。Because there is a difference in scales, NormalizeMinMax is applied to the Features column to normalize the data.

// Define Data Prep Estimator
// 1. Concatenate Size and Historical into a single feature vector output to a new column called Features
// 2. Normalize Features vector
IEstimator<ITransformer> dataPrepEstimator =
    mlContext.Transforms.Concatenate("Features", "Size", "HistoricalPrices")
        .Append(mlContext.Transforms.NormalizeMinMax("Features"));

// Create data prep transformer
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(trainData);

// Apply transforms to training data
IDataView transformedTrainingData = dataPrepTransformer.Transform(trainData);

使用預設資料行名稱Working with default column names

ML.NET 演算法會在沒有指定任何項目時使用預設資料行名稱。ML.NET algorithms use default column names when none are specified. 所有訓練員都具有一個稱為 featureColumnName 的參數作為演算法的輸入,且當適用時,他們也會針對預期值擁有一個稱為 labelColumnName 的參數。All trainers have a parameter called featureColumnName for the inputs of the algorithm and when applicable they also have a parameter for the expected value called labelColumnName. 根據預設,這些值分別是 FeaturesLabelBy default those values are Features and Label respectively.

透過在預先處理期間使用 Concatenate 方法來建立稱為 Features 的新資料行,便不需要在演算法的參數中指定特徵資料行名稱,因為它們已存在於預先處理的 IDataView 中。By using the Concatenate method during pre-processing to create a new column called Features, there is no need to specify the feature column name in the parameters of the algorithm since it already exists in the pre-processed IDataView. 標籤資料行是 CurrentPrice,但因為已在資料模型中使用 ColumnName 屬性,ML.NET 會將 CurrentPrice 資料行重新命名成 Label,使其不再需要提供 labelColumnName 參數給機器學習服務演算法的估算。The label column is CurrentPrice, but since the ColumnName attribute is used in the data model, ML.NET renames the CurrentPrice column to Label which removes the need to provide the labelColumnName parameter to the machine learning algorithm estimator.

若您不想要使用預設資料行名稱,請在定義機器學習服務演算法估算時將特徵和標籤資料行的名稱作為參數傳遞,如接下來的程式碼片段所示:If you don't want to use the default column names, pass in the names of the feature and label columns as parameters when defining the machine learning algorithm estimator as demonstrated by the subsequent snippet:

var UserDefinedColumnSdcaEstimator = mlContext.Regression.Trainers.Sdca(labelColumnName: "MyLabelColumnName", featureColumnName: "MyFeatureColumnName");

定型機器學習模型Train the machine learning model

預先處理資料後,請使用 Fit 方法來使用 StochasticDualCoordinateAscent 迴歸演算法定型機器學習模型。Once the data is pre-processed, use the Fit method to train the machine learning model with the StochasticDualCoordinateAscent regression algorithm.

// Define StochasticDualCoordinateAscent regression algorithm estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();

// Build machine learning model
var trainedModel = sdcaEstimator.Fit(transformedTrainingData);

擷取模型參數Extract model parameters

在定型模型後,請擷取學習到的 ModelParameters 以進行檢查或重新定型。After the model has been trained, extract the learned ModelParameters for inspection or re-training. LinearRegressionModelParameters 可提供偏差和學習到的相關係數,或是定型模型的權數。The LinearRegressionModelParameters provide the bias and learned coefficients or weights of the trained model.

var trainedModelParameters = trainedModel.Model as LinearRegressionModelParameters;

注意

其他模型也具有其工作限定的參數。Other models have parameters that are specific to their tasks. 例如,K-平均演算法會根據距心將資料放入叢集,KMeansModelParameters 則包含儲存這些所學習到距心的屬性。For example, the K-Means algorithm puts data into cluster based on centroids and the KMeansModelParameters contains a property that stores these learned centroids. 若要深入了解,請前往 Microsoft.ML.Trainers API 文件並尋找其名稱中包含 ModelParameters 的類別。To learn more, visit the Microsoft.ML.Trainers API Documentation and look for classes that contain ModelParameters in their name.

評估模型品質Evaluate model quality

若要協助選擇最佳的執行模型,評估其在測試資料上的效能非常重要。To help choose the best performing model, it is essential to evaluate its performance on test data. 請使用 Evaluate 方法來針對定型後的模型測量各種計量。Use the Evaluate method, to measure various metrics for the trained model.

注意

Evaluate 方法會根據執行的機器學習服務工作類型,產生不同的計量。The Evaluate method produces different metrics depending on which machine learning task was performed. 如需詳細資訊,請前往 Microsoft.ML.Data API 文件並尋找其名稱中包含 Metrics 的類別。For more details, visit the Microsoft.ML.Data API Documentation and look for classes that contain Metrics in their name.

// Measure trained model performance
// Apply data prep transformer to test data
IDataView transformedTestData = dataPrepTransformer.Transform(testData);

// Use trained model to make inferences on test data
IDataView testDataPredictions = trainedModel.Transform(transformedTestData);

// Extract model metrics and get RSquared
RegressionMetrics trainedModelMetrics = mlContext.Regression.Evaluate(testDataPredictions);
double rSquared = trainedModelMetrics.RSquared;

在先前的程式碼範例中:In the previous code sample:

  1. 測試資料集已使用先前定義的資料準備轉換進行預先處理。Test data set is pre-processed using the data preparation transforms previously defined.
  2. 定型後的機器學習模型會用來針對測試資料進行預測。The trained machine learning model is used to make predictions on the test data.
  3. Evaluate 方法中,測試資料集 CurrentPrice 資料行中的值會和新輸出預測的 Score 資料行比較,計算迴歸模型的計量,其中一個的決定係數儲存在 rSquared 變數中。In the Evaluate method, the values in the CurrentPrice column of the test data set are compared against the Score column of the newly output predictions to calculate the metrics for the regression model, one of which, R-Squared is stored in the rSquared variable.

注意

在此小型範例中,由於資料的限制大小,決定係數是不介於 0 到 1 範圍內的數字。In this small example, the R-Squared is a number not in the range of 0-1 because of the limited size of the data. 在現實世界案例中,您應預期介於 0 和 1 之間的值。In a real-world scenario, you should expect to see a value between 0 and 1.