什麼是 ML.NET,它如何運作?What is ML.NET and how does it work?

ML.NET 能讓您將機器學習新增至 .NET 應用程式。ML.NET gives you the ability to add machine learning to .NET applications. 使用這項功能,您可以使用應用程式可用的資料來建立自動預測。With this capability, you can make automatic predictions using the data available to your application. 本文說明 ML.NET 的機器學習基本概念。This article explains the basics of machine learning in ML.NET.

您可使用 ML.NET 建立的預測類型範例包括:Examples of the type of predictions that you can make with ML.NET include:

分類/分類Classification/Categorization 自動將客戶的意見反應分成正負類別Automatically divide customer feedback into positive and negative categories
迴歸/預測連續值Regression/Predict continuous values 根據大小和位置預測房價Predict the price of houses based on size and location
異常偵測Anomaly Detection 偵測詐騙的銀行交易Detect fraudulent banking transactions
建議Recommendations 根據線上購物者之前的購買記錄,建議他們可能想要購買的產品Suggest products that online shoppers may want to buy, based on their previous purchases

Hello ML.NET WorldHello ML.NET World

下列程式碼片段中程式碼會示範最簡單的 ML.NET 應用程式。The code in the following snippet demonstrates the simplest ML.NET application. 此範例會建構線性迴歸模型,使用房子大小及價格資料來預測房價。This example constructs a linear regression model to predict house prices using house size and price data. 在實際的應用程式中,您的資料和模型會更複雜。In your real-life applications, your data and model will be much more complex.

   using System;
   using Microsoft.ML;
   using Microsoft.ML.Data;
   
   class Program
   {
       public class HouseData
       {
           public float Size { get; set; }
           public float Price { get; set; }
       }
   
       public class Prediction
       {
           [ColumnName("Score")]
           public float Price { get; set; }
       }
   
       static void Main(string[] args)
       {
           MLContext mlContext = new MLContext();
   
           // 1. Import or create training data
           HouseData[] houseData = {
               new HouseData() { Size = 1.1F, Price = 1.2F },
               new HouseData() { Size = 1.9F, Price = 2.3F },
               new HouseData() { Size = 2.8F, Price = 3.0F },
               new HouseData() { Size = 3.4F, Price = 3.7F } };
           IDataView trainingData = mlContext.Data.LoadFromEnumerable(houseData);

           // 2. Specify data preparation and model training pipeline
           var pipeline = mlContext.Transforms.Concatenate("Features", new[] { "Size" })
               .Append(mlContext.Regression.Trainers.Sdca(labelColumnName: "Price", maximumNumberOfIterations: 100));
   
           // 3. Train model
           var model = pipeline.Fit(trainingData);
   
           // 4. Make a prediction
           var size = new HouseData() { Size = 2.5F };
           var price = mlContext.Model.CreatePredictionEngine<HouseData, Prediction>(model).Predict(size);

           Console.WriteLine($"Predicted price for size: {size.Size*1000} sq ft= {price.Price*100:C}k");

           // Predicted price for size: 2500 sq ft= $261.98k
       }
   } 

程式碼工作流程Code workflow

下圖代表應用程式程式碼結構,以及模型開發的反覆程序:The following diagram represents the application code structure, as well as the iterative process of model development:

  • 收集定型資料,並將其載入 IDataView 物件Collect and load training data into an IDataView object
  • 指定擷取特性的作業管線並套用機器學習演算法Specify a pipeline of operations to extract features and apply a machine learning algorithm
  • 對管線呼叫 Fit() 以定型模型Train a model by calling Fit() on the pipeline
  • 評估模型並反覆運算來加以改善Evaluate the model and iterate to improve
  • 將模型儲存成二進位格式,供應用程式使用Save the model into binary format, for use in an application
  • 將模型載回至 ITransformer 物件Load the model back into an ITransformer object
  • 呼叫 CreatePredictionEngine.Predict() 以建立預測Make predictions by calling CreatePredictionEngine.Predict()

ML.NET 應用程式開發流程包括產生資料、開發管線、定型模型、評估模型和使用模型的元件

讓我們稍微深入探討這些概念。Let's dig a little deeper into those concepts.

機器學習模型Machine learning model

ML.NET 模型是一個物件,包含要對輸入資料執行的轉換,以達成預測的輸出。An ML.NET model is an object that contains transformations to perform on your input data to arrive at the predicted output.

基本Basic

最基本的模型是二維線性迴歸,其中一個持續數量和另一個成正比,如上述的房價範例所示。The most basic model is two-dimensional linear regression, where one continuous quantity is proportional to another, as in the house price example above.

使用偏差和加權參數的線性迴歸模型

此模型就只是:$Price = b + Size * w$。The model is simply: $Price = b + Size * w$. 參數 $b$ 和 $w$ 的評估方式是根據一組 (大小、價格) 對組來調整線條。The parameters $b$ and $w$ are estimated by fitting a line on a set of (size, price) pairs. 用來尋找模型參數的資料稱為定型資料The data used to find the parameters of the model is called training data. 機器學習模型的輸入稱為特性The inputs of a machine learning model are called features. 在本例中,$Size$ 是唯一的特性。In this example, $Size$ is the only feature. 用來定型機器學習模型的實況資料稱為標籤The ground-truth values used to train a machine learning model are called labels. 在這裡,定型資料集的 $Price$ 值是標籤。Here, the $Price$ values in the training data set are the labels.

更複雜More complex

更複雜模型會使用交易的文字描述將財務交易分類成類別。A more complex model classifies financial transactions into categories using the transaction text description.

每條交易描述會移除多餘的字詞和字元,並計算字詞和字元組合,細分成一組特性。Each transaction description is broken down into a set of features by removing redundant words and characters, and counting word and character combinations. 以定型資料中的類別集為基礎,用特性集定型線性模型。The feature set is used to train a linear model based on the set of categories in the training data. 新描述與定型集中的描述愈相似,愈可能指派給同一分類。The more similar a new description is to the ones in the training set, the more likely it will be assigned to the same category.

文字分類模型

房價模型和文字分類模型都是線性模型。Both the house price model and the text classification model are linear models. 根據資料本質以及要解決的問題本質,您也可以使用決策樹模型、一般化累加模型和其他模型。Depending on the nature of your data and the problem you are solving, you can also use decision tree models, generalized additive models, and others. 您可以在任務中深入了解模型。You can find out more about the models in Tasks.

資料準備Data preparation

在大部分的情況下,您所用的資料不適合直接用來定型機器學習模型。In most cases, the data that you have available isn't suitable to be used directly to train a machine learning model. 未經處理資料需要經過準備或前置處理,才能用來尋找您模型的參數。The raw data needs to be prepared, or pre-processed before it can be used to find the parameters of your model. 您的資料可能需要從字串值轉換成數值表示。Your data may need to be converted from string values to a numerical representation. 您的輸入資料中可能有多餘資訊。You might have redundant information in your input data. 您可能需要縮小或擴充您輸入資料的維度。You may need to reduce or expand the dimensions of your input data. 您的資料可能需要標準化或調整。Your data might need to be normalized or scaled.

ML.NET 教學課程會教導您用於特定機器學習工作之文字、影像、數值和時間序列資料的不同資料處理管線。The ML.NET tutorials teach you about different data processing pipelines for text, image, numerical, and time-series data used for specific machine learning tasks.

如何準備您的資料會示範如何更全面地套用資料準備。How to prepare your data shows you how to applied data preparation more generally.

您可以在<資源>一節中找到所有可用轉換的附錄。You can find an appendix of all of the available transformations in the resources section.

模型評估Model evaluation

定型模型之後,您怎麼知道它對未來的預測有多準?Once you have trained your model, how do you know how well it will make future predictions? 使用 ML.NET,您可以利用某些新的測試資料來評估模型。With ML.NET, you can evaluate your model against some new test data.

每種機器學習工作都有針對測試資料集,用來評估模型正確性和精確度的計量。Each type of machine learning task has metrics used to evaluate the accuracy and precision of the model against the test data set.

在房價範例中,我們使用了迴歸工作。For our house price example, we used the Regression task. 若要評估模型,請將下列程式碼新增至原始範例。To evaluate the model, add the following code to the original sample.

        HouseData[] testHouseData =
        {
            new HouseData() { Size = 1.1F, Price = 0.98F },
            new HouseData() { Size = 1.9F, Price = 2.1F },
            new HouseData() { Size = 2.8F, Price = 2.9F },
            new HouseData() { Size = 3.4F, Price = 3.6F }
        };

        var testHouseDataView = mlContext.Data.LoadFromEnumerable(testHouseData);
        var testPriceDataView = model.Transform(testHouseDataView);
                
        var metrics = mlContext.Regression.Evaluate(testPriceDataView, labelColumnName: "Price");

        Console.WriteLine($"R^2: {metrics.RSquared:0.##}");
        Console.WriteLine($"RMS error: {metrics.RootMeanSquaredError:0.##}");

        // R^2: 0.96
        // RMS error: 0.19

評估計量會告訴您錯誤有點低,且預測輸出和測試輸出之間的關聯性很高。The evaluation metrics tell you that the error is low-ish, and that correlation between the predicted output and the test output is high. 很簡單!That was easy! 在實際範例中,需要更多微調才能達到良好的模型計量。In real examples, it takes more tuning to achieve good model metrics.

ML.NET 架構ML.NET architecture

在本節中,我們要探討 ML.NET 的架構模式。In this section, we go through the architectural patterns of ML.NET. 如果您是有經驗的 .NET 開發人員,對這些模式有些很熟悉,有些則較不熟悉。If you are an experienced .NET developer, some of these patterns will be familiar to you, and some will be less familiar. 跟上來一探究竟吧!Hold tight, while we dive in!

ML.NET 應用程式以 MLContext 物件開始。An ML.NET application starts with an MLContext object. 此單一物件包含目錄This singleton object contains catalogs. 目錄是資料載入儲存、轉換、定型器和模型作業元件的處理站。A catalog is a factory for data loading and saving, transforms, trainers, and model operation components. 每個目錄物件都有建立不同類型元件的方法:Each catalog object has methods to create the different types of components:

資料載入及儲存Data loading and saving DataOperationsCatalog
資料準備Data preparation TransformsCatalog
定型演算法Training algorithms 二元分類Binary classification BinaryClassificationCatalog
多元分類Multiclass classification MulticlassClassificationCatalog
異常偵測Anomaly detection AnomalyDetectionCatalog
群集Clustering ClusteringCatalog
預測Forecasting ForecastingCatalog
排名Ranking RankingCatalog
回復Regression RegressionCatalog
建議Recommendation RecommendationCatalog 新增 Microsoft.ML.Recommender NuGet 套件add the Microsoft.ML.Recommender NuGet package
TimeSeriesTimeSeries TimeSeriesCatalog 新增 Microsoft.ML.TimeSeries NuGet 套件add the Microsoft.ML.TimeSeries NuGet package
模型使用方式Model usage ModelOperationsCatalog

您可以巡覽至上述每個類別的建立方法。You can navigate to the creation methods in each of the above categories. 使用 Visual Studio,會透過 IntelliSense 顯示目錄。Using Visual Studio, the catalogs show up via IntelliSense.

適用於迴歸定型器的 Intellisense

建置管線Build the pipeline

每個目錄中都是一組擴充方法。Inside each catalog is a set of extension methods. 讓我們看看如何使用擴充方法建立定型管線。Let's look at how extension methods are used to create a training pipeline.

    var pipeline = mlContext.Transforms.Concatenate("Features", new[] { "Size" })
        .Append(mlContext.Regression.Trainers.Sdca(labelColumnName: "Price", maximumNumberOfIterations: 100));

在程式碼片段中,ConcatenateSdca 都是目錄中的方法。In the snippet, Concatenate and Sdca are both methods in the catalog. 它們會各自建立附加至管線的 IEstimator 物件。They each create an IEstimator object that is appended to the pipeline.

此時,只會建立物件。At this point, the objects are created only. 尚未開始執行。No execution has happened.

將模型定型Train the model

管線中一旦建立物件,資料就可用來定型模型。Once the objects in the pipeline have been created, data can be used to train the model.

    var model = pipeline.Fit(trainingData);

呼叫 Fit() 會使用輸入定型資料來評估模型的參數。Calling Fit() uses the input training data to estimate the parameters of the model. 這就是定型模型。This is known as training the model. 請記住,上述的線性迴歸模型有兩個模型參數:偏差權數Remember, the linear regression model above had two model parameters: bias and weight. 呼叫 Fit() 之後,即已知參數值。After the Fit() call, the values of the parameters are known. 大部分模型的參數都比這個模型多。Most models will have many more parameters than this.

您可以在如何定型模型中深入了解模型定型You can learn more about model training in How to train your model

產生的模型物件會實作 ITransformer 介面。The resulting model object implements the ITransformer interface. 亦即,模型會將輸入資料轉換成預測。That is, the model transforms input data into predictions.

   IDataView predictions = model.Transform(inputData);

使用模型Use the model

您可以將大量輸入資料或一次一筆輸入資料轉換成預測。You can transform input data into predictions in bulk, or one input at a time. 在房價範例中,我們兩種都做了:大量資料是用於評估模型,一次一筆資料則是為了建立新的預測。In the house price example, we did both: in bulk for the purpose of evaluating the model, and one at a time to make a new prediction. 讓我們看看建立單一預測。Let's look at making single predictions.

    var size = new HouseData() { Size = 2.5F };
    var predEngine = mlContext.CreatePredictionEngine<HouseData, Prediction>(model);
    var price = predEngine.Predict(size);

CreatePredictionEngine() 方法接受輸入類別和輸出類別。The CreatePredictionEngine() method takes an input class and an output class. 欄位名稱及/或程式碼屬性決定模型定型和預測期間所用的資料行名稱。The field names and/or code attributes determine the names of the data columns used during model training and prediction. 您可以閱讀<做法>一節中的如何建立單一預測You can read about How to make a single prediction in the How-to section.

資料模型和結構描述Data models and schema

ML.NET 機器學習管線的核心是 DataView 物件。At the core of an ML.NET machine learning pipeline are DataView objects.

管線中的每個轉換都有輸入結構描述 (轉換預期在其輸入中看到的資料名稱、類型和大小) 以及輸出結構描述 (轉換在轉換後產生的資料名稱、類型和大小)。Each transformation in the pipeline has an input schema (data names, types, and sizes that the transform expects to see on its input); and an output schema (data names, types, and sizes that the transform produces after the transformation).

如果管線中來自轉換之輸出結構描述不符合下一個轉換的輸入結構描述,則 ML.NET 會擲回例外狀況。If the output schema from one transform in the pipeline doesn't match the input schema of the next transform, ML.NET will throw an exception.

資料檢視物件具有資料行和資料列。A data view object has columns and rows. 每個資料行都有名稱和類型以及長度。Each column has a name and a type and a length. 例如:房價範例的輸入資料行為大小價格For example: the input columns in the house price example are Size and Price. 它們都是類型,其數量為純量而非向量。They are both type and they are scalar quantities rather than vector ones.

具有房價預測資料的 ML.NET 資料檢視範例

所有 ML.NET 演算法都在尋找向量的輸入資料行。All ML.NET algorithms look for an input column that is a vector. 根據預設,此向量資料行稱為特性By default this vector column is called Features. 這就是為什麼我們要在房價範例中,把 [大小] 資料行串連到稱為 [特性] 的新資料行。This is why we concatenated the Size column into a new column called Features in our house price example.

   var pipeline = mlContext.Transforms.Concatenate("Features", new[] { "Size" })

所有演算法也都會在執行預測之後,建立新的資料行。All algorithms also create new columns after they have performed a prediction. 這些新資料行的固定名稱取決於機器學習演算法類型。The fixed names of these new columns depend on the type of machine learning algorithm. 若為迴歸工作,其中一個新資料行稱為 [分數] 。For the regression task, one of the new columns is called Score. 這就是為什麼我們要使用這個名稱作為價格資料的屬性。This is why we attributed our price data with this name.

    public class Prediction
    {
        [ColumnName("Score")]
        public float Price { get; set; }
    }

您可以在機器學習工作指南中深入了解不同機器學習工作的輸出資料行。You can find out more about output columns of different machine learning tasks in the Machine Learning Tasks guide.

DataView 物件的重要屬性是它們都延遲評估。An important property of DataView objects is that they are evaluated lazily. 資料檢視只會在模型定型和評估期間以及資料預測期間載入及操作。Data views are only loaded and operated on during model training and evaluation, and data prediction. 當您撰寫和測試 ML.NET 應用程式時,您可以呼叫 Preview 方法,使用 Visual Studio 偵錯工具看一下任何資料檢視物件。While you are writing and testing your ML.NET application, you can use the Visual Studio debugger to take a peek at any data view object by calling the Preview method.

    var debug = testPriceDataView.Preview();

您可以在偵錯工具中觀看 debug 變數,並檢查其內容。You can watch the debug variable in the debugger and examine its contents. 請勿在實際程式碼中使用 Preview 方法,因為它會大幅降低效能。Do not use the Preview method in production code, as it significantly degrades performance.

模型部署Model Deployment

在實際的應用程式中,您的模型定型和評估程式碼與預測無關。In real-life applications, your model training and evaluation code will be separate from your prediction. 事實上,這兩項活動通常是由不同的小組執行。In fact, these two activities are often performed by separate teams. 您的模型開發小組可以儲存模型,供預測應用程式使用。Your model development team can save the model for use in the prediction application.

   mlContext.Model.Save(model, trainingData.Schema,"model.zip");

接下來去哪?Where to now?

您可以在教學課程中了解如何使用不同機器學習工作搭配更實際的資料集來建置應用程式。You can learn how to build applications using different machine learning tasks with more realistic data sets in the tutorials.

或者您可以在操作指南中深入了解特定的主題。Or you can learn about specific topics in more depth in the How To Guides.

如果您很急切,您可以直接深入 API 參考文件And if you're super keen, you can dive straight into the API Reference documentation!