什么是 ML.NET 以及它如何工作?What is ML.NET and how does it work?

ML.NET 使你能够在联机或脱机场景中将机器学习添加到 .NET 应用程序中。ML.NET gives you the ability to add machine learning to .NET applications, in either online or offline scenarios. 借助此功能,可以使用应用程序的可用数据进行自动预测。With this capability, you can make automatic predictions using the data available to your application. 机器学习应用程序利用数据中的模式来进行预测,而不需要进行显式编程。Machine learning applications make use of patterns in the data to make predictions rather than needing to be explicitly programmed.

ML.NET 的核心是机器学习模型 。Central to ML.NET is a machine learning model. 该模型指定将输入数据转换为预测所需的步骤。The model specifies the steps needed to transform your input data into a prediction. 借助 ML.NET,可以通过指定算法来训练自定义模型,也可以导入预训练的 TensorFlow 和 ONNX 模型。With ML.NET, you can train a custom model by specifying an algorithm, or you can import pre-trained TensorFlow and ONNX models.

拥有模型后,可以将其添加到应用程序中进行预测。Once you have a model, you can add it to your application to make the predictions.

ML.NET 在使用 .NET Core 的 Windows、Linux 和 macOS 或使用 .NET Framework 的 Windows 上运行。ML.NET runs on Windows, Linux, and macOS using .NET Core, or Windows using .NET Framework. 所有平台均支持 64 位。64 bit is supported on all platforms. Windows 支持 32 位,TensorFlow、LightGBM 和 ONNX 相关功能除外。32 bit is supported on Windows, except for TensorFlow, LightGBM, and ONNX-related functionality.

可以使用 ML.NET 进行的预测类型的示例:Examples of the type of predictions that you can make with ML.NET:

分类/类别划分Classification/Categorization 自动将客户反馈划分为正面和负面类别Automatically divide customer feedback into positive and negative categories
回归/预测连续值Regression/Predict continuous values 根据大小和位置预测房屋价格Predict the price of houses based on size and location
异常情况检测Anomaly Detection 检测欺诈性银行交易Detect fraudulent banking transactions
建议Recommendations 根据在线购物者之前的购买情况向其建议可能想要购买的产品Suggest products that online shoppers may want to buy, based on their previous purchases
时序/顺序数据Time series/sequential data 预测天气/产品销售额Forecast the weather/product sales
图像分类Image classification 对医学影像中的病状进行分类Categorize pathologies in medical images

Hello ML.NET WorldHello ML.NET World

以下代码片段中的代码演示了最简单的 ML.NET 应用程序。The code in the following snippet demonstrates the simplest ML.NET application. 此示例构造线性回归模型,使用房屋大小和价格数据预测房屋价格。This example constructs a linear regression model to predict house prices using house size and price data.

   using System;
   using Microsoft.ML;
   using Microsoft.ML.Data;

   class Program
   {
       public class HouseData
       {
           public float Size { get; set; }
           public float Price { get; set; }
       }

       public class Prediction
       {
           [ColumnName("Score")]
           public float Price { get; set; }
       }

       static void Main(string[] args)
       {
           MLContext mlContext = new MLContext();

           // 1. Import or create training data
           HouseData[] houseData = {
               new HouseData() { Size = 1.1F, Price = 1.2F },
               new HouseData() { Size = 1.9F, Price = 2.3F },
               new HouseData() { Size = 2.8F, Price = 3.0F },
               new HouseData() { Size = 3.4F, Price = 3.7F } };
           IDataView trainingData = mlContext.Data.LoadFromEnumerable(houseData);

           // 2. Specify data preparation and model training pipeline
           var pipeline = mlContext.Transforms.Concatenate("Features", new[] { "Size" })
               .Append(mlContext.Regression.Trainers.Sdca(labelColumnName: "Price", maximumNumberOfIterations: 100));

           // 3. Train model
           var model = pipeline.Fit(trainingData);

           // 4. Make a prediction
           var size = new HouseData() { Size = 2.5F };
           var price = mlContext.Model.CreatePredictionEngine<HouseData, Prediction>(model).Predict(size);

           Console.WriteLine($"Predicted price for size: {size.Size*1000} sq ft= {price.Price*100:C}k");

           // Predicted price for size: 2500 sq ft= $261.98k
       }
   }

代码工作流Code workflow

以下关系图表示应用程序代码结构,以及模型开发的迭代过程:The following diagram represents the application code structure, as well as the iterative process of model development:

  • 将训练数据收集并加载到 IDataView 对象中Collect and load training data into an IDataView object
  • 指定操作的管道,以提取特征并应用机器学习算法Specify a pipeline of operations to extract features and apply a machine learning algorithm
  • 通过在管道上调用 Fit() 来训练模型Train a model by calling Fit() on the pipeline
  • 评估模型并通过迭代进行改进Evaluate the model and iterate to improve
  • 将模型保存为二进制格式,以便在应用程序中使用Save the model into binary format, for use in an application
  • 将模型加载回 ITransformer 对象Load the model back into an ITransformer object
  • 通过调用 CreatePredictionEngine.Predict() 进行预测Make predictions by calling CreatePredictionEngine.Predict()

ML.NET 应用程序开发流包括用于数据生成、管道开发、模型训练、模型评估和模型使用的组件

让我们更深入地探讨这些概念。Let's dig a little deeper into those concepts.

机器学习模型Machine learning model

ML.NET 模型是一个对象,它包含为了获得预测输出而要对输入数据执行的转换。An ML.NET model is an object that contains transformations to perform on your input data to arrive at the predicted output.

BasicBasic

最基本的模型是二维线性回归,其中一个连续数量与另一个连续数量成比例关系,如上述房价示例所示。The most basic model is two-dimensional linear regression, where one continuous quantity is proportional to another, as in the house price example above.

具有偏差和权重参数的线性回归模型

模型很简单:$Price = b + Size * w$。The model is simply: $Price = b + Size * w$. 参数 $b$ 和 $w$ 通过根据一组 (size, price) 对拟合一根直线来进行估算。The parameters $b$ and $w$ are estimated by fitting a line on a set of (size, price) pairs. 用于查找模型参数的数据称为训练数据The data used to find the parameters of the model is called training data. 机器学习模型的输入称为特征The inputs of a machine learning model are called features. 在此示例中,$Size$ 是唯一的特征。In this example, $Size$ is the only feature. 用于训练机器学习模型的真值称为标签The ground-truth values used to train a machine learning model are called labels. 此处训练数据集中的 $Price$ 值是标签。Here, the $Price$ values in the training data set are the labels.

更复杂More complex

更复杂的模型使用事务文本描述将金融事务分类为类别。A more complex model classifies financial transactions into categories using the transaction text description.

通过删除冗余的字词和字符,以及对字词和字符组合进行计数,每个事务描述都被分解为一组特征。Each transaction description is broken down into a set of features by removing redundant words and characters, and counting word and character combinations. 该特征集用于基于训练数据中的类别集训练线性模型。The feature set is used to train a linear model based on the set of categories in the training data. 新描述与训练集中的描述越相似,它就越有可能被分配到同一类别。The more similar a new description is to the ones in the training set, the more likely it will be assigned to the same category.

文本分类模型

房屋价格模型和文本分类模型均为线性模型。Both the house price model and the text classification model are linear models. 根据数据的性质和要解决的问题,还可以使用决策树模型、广义加性模型和其他模型。Depending on the nature of your data and the problem you are solving, you can also use decision tree models, generalized additive models, and others. 可以在任务中找到有关模型的详细信息。You can find out more about the models in Tasks.

数据准备Data preparation

在大多数情况下,可用的数据不适合直接用于训练机器学习模型。In most cases, the data that you have available isn't suitable to be used directly to train a machine learning model. 需要准备或预处理原始数据,然后才能将其用于查找模型的参数。The raw data needs to be prepared, or pre-processed, before it can be used to find the parameters of your model. 数据可能需要从字符串值转换为数字表示形式。Your data may need to be converted from string values to a numerical representation. 输入数据中可能会包含冗余信息。You might have redundant information in your input data. 可能需要缩小或放大输入数据的维度。You may need to reduce or expand the dimensions of your input data. 数据可能需要进行规范化或缩放。Your data might need to be normalized or scaled.

ML.NET 教程讲解用于特定机器学习任务的文本、图像、数字和时序数据的不同数据处理管道。The ML.NET tutorials teach you about different data processing pipelines for text, image, numerical, and time-series data used for specific machine learning tasks.

如何准备数据展示了如何更广泛地应用数据准备。How to prepare your data shows you how to apply data preparation more generally.

可以在“资源”部分找到所有可用转换的附录。You can find an appendix of all of the available transformations in the resources section.

模型评估Model evaluation

训练模型后,如何了解其进行未来预测的表现如何?Once you have trained your model, how do you know how well it will make future predictions? 借助 ML.NET,可以根据一些新的测试数据评估模型。With ML.NET, you can evaluate your model against some new test data.

每种类型的机器学习任务都具有用于根据测试数据集评估模型的准确性和精确性的指标。Each type of machine learning task has metrics used to evaluate the accuracy and precision of the model against the test data set.

对于我们的房屋价格示例,我们使用了回归任务。For our house price example, we used the Regression task. 若要评估模型,请将以下代码添加到原始示例中。To evaluate the model, add the following code to the original sample.

        HouseData[] testHouseData =
        {
            new HouseData() { Size = 1.1F, Price = 0.98F },
            new HouseData() { Size = 1.9F, Price = 2.1F },
            new HouseData() { Size = 2.8F, Price = 2.9F },
            new HouseData() { Size = 3.4F, Price = 3.6F }
        };

        var testHouseDataView = mlContext.Data.LoadFromEnumerable(testHouseData);
        var testPriceDataView = model.Transform(testHouseDataView);

        var metrics = mlContext.Regression.Evaluate(testPriceDataView, labelColumnName: "Price");

        Console.WriteLine($"R^2: {metrics.RSquared:0.##}");
        Console.WriteLine($"RMS error: {metrics.RootMeanSquaredError:0.##}");

        // R^2: 0.96
        // RMS error: 0.19

通过评估指标可得知错误率相当低,且预测输出和测试输出之间的相关性很高。The evaluation metrics tell you that the error is low-ish, and that correlation between the predicted output and the test output is high. 这很简单!That was easy! 在实际示例中,需要进行更多调整才能获得良好的模型指标。In real examples, it takes more tuning to achieve good model metrics.

ML.NET 体系结构ML.NET architecture

在本部分中,我们将介绍 ML.NET 的体系结构模式。In this section, we go through the architectural patterns of ML.NET. 如果你是一位经验丰富的 .NET 开发人员,则你对其中一些模式可能已经很熟悉,但对有些模式则不那么熟悉。If you are an experienced .NET developer, some of these patterns will be familiar to you, and some will be less familiar. 集中精神,让我们开始深入探索!Hold tight, while we dive in!

ML.NET 应用程序从 MLContext 对象开始。An ML.NET application starts with an MLContext object. 此单一实例对象包含目录This singleton object contains catalogs. 目录是用于数据加载和保存、转换、训练程序和模型操作组件的工厂。A catalog is a factory for data loading and saving, transforms, trainers, and model operation components. 每个目录对象都具有创建不同类型的组件的方法:Each catalog object has methods to create the different types of components:

数据加载和保存Data loading and saving DataOperationsCatalog
数据准备Data preparation TransformsCatalog
训练算法Training algorithms 二元分类Binary classification BinaryClassificationCatalog
多类分类Multiclass classification MulticlassClassificationCatalog
异常情况检测Anomaly detection AnomalyDetectionCatalog
聚类分析Clustering ClusteringCatalog
预测Forecasting ForecastingCatalog
排名Ranking RankingCatalog
回归测试Regression RegressionCatalog
建议Recommendation RecommendationCatalog 添加 Microsoft.ML.Recommender NuGet 包add the Microsoft.ML.Recommender NuGet package
TimeSeriesTimeSeries TimeSeriesCatalog 添加 Microsoft.ML.TimeSeries NuGet 包add the Microsoft.ML.TimeSeries NuGet package
模型使用Model usage ModelOperationsCatalog

可以导航到上述各个类别的创建方法。You can navigate to the creation methods in each of the above categories. 使用 Visual Studio 时,目录通过 IntelliSense 显示。Using Visual Studio, the catalogs show up via IntelliSense.

用于回归训练程序的 Intellisense

生成管道Build the pipeline

每个目录中都有一组扩展方法。Inside each catalog is a set of extension methods. 让我们看看如何使用扩展方法创建训练管道。Let's look at how extension methods are used to create a training pipeline.

    var pipeline = mlContext.Transforms.Concatenate("Features", new[] { "Size" })
        .Append(mlContext.Regression.Trainers.Sdca(labelColumnName: "Price", maximumNumberOfIterations: 100));

在代码片段中,ConcatenateSdca 均为目录中的方法。In the snippet, Concatenate and Sdca are both methods in the catalog. 它们各创建一个追加到管道的 IEstimator 对象。They each create an IEstimator object that is appended to the pipeline.

此时,仅创建对象。At this point, the objects are created only. 不进行任何执行操作。No execution has happened.

定型模型Train the model

在管道中创建对象后,即可使用数据来训练模型。Once the objects in the pipeline have been created, data can be used to train the model.

    var model = pipeline.Fit(trainingData);

调用 Fit() 使用输入训练数据来估算模型的参数。Calling Fit() uses the input training data to estimate the parameters of the model. 这称为训练模型。This is known as training the model. 请记住,上述线性回归模型有两个模型参数:偏差权重Remember, the linear regression model above had two model parameters: bias and weight. Fit() 调用后,参数的值是已知的。After the Fit() call, the values of the parameters are known. 大部分模型拥有的参数比这多得多。Most models will have many more parameters than this.

可以在如何训练模型中了解有关模型训练的详细信息。You can learn more about model training in How to train your model.

生成的模型对象实现 ITransformer 接口。The resulting model object implements the ITransformer interface. 也就是说,模型将输入数据转换为预测。That is, the model transforms input data into predictions.

   IDataView predictions = model.Transform(inputData);

使用模型Use the model

可以将输入数据批量转换为预测,也可以一次转换一个输入。You can transform input data into predictions in bulk, or one input at a time. 在房屋价格示例中,我们同时执行了两种操作:为了评估模型而执行批量转换,以及为了进行新预测而执行单次转换。In the house price example, we did both: in bulk for the purpose of evaluating the model, and one at a time to make a new prediction. 让我们进行单个预测。Let's look at making single predictions.

    var size = new HouseData() { Size = 2.5F };
    var predEngine = mlContext.CreatePredictionEngine<HouseData, Prediction>(model);
    var price = predEngine.Predict(size);

CreatePredictionEngine() 方法接受一个输入类和一个输出类。The CreatePredictionEngine() method takes an input class and an output class. 字段名称和/或代码属性确定模型训练和预测期间使用的数据列的名称。The field names and/or code attributes determine the names of the data columns used during model training and prediction. 可以在“操作说明”部分中了解如何进行单个预测You can read about How to make a single prediction in the How-to section.

数据模型和架构Data models and schema

ML.NET 机器学习管道的核心是 DataView 对象。At the core of an ML.NET machine learning pipeline are DataView objects.

管道中的每个转换都有一个输入架构(转换期望在其输入中看到的数据名称、类型和大小);以及一个输出架构(转换在转换后生成的数据名称、类型和大小)。Each transformation in the pipeline has an input schema (data names, types, and sizes that the transform expects to see on its input); and an output schema (data names, types, and sizes that the transform produces after the transformation). 下面的文档提供 IDataView 接口及其类型系统的深度解析。The following document provides an in-depth explanation of the IDataView interface and its type system.

如果管道中一个转换的输出架构与下一个转换的输入架构不匹配,ML.NET 将引发异常。If the output schema from one transform in the pipeline doesn't match the input schema of the next transform, ML.NET will throw an exception.

数据视图对象具有列和行。A data view object has columns and rows. 每个列都有名称、类型和长度。Each column has a name and a type and a length. 例如,房屋价格示例中的输入列为“大小” 和“价格” 。For example, the input columns in the house price example are Size and Price. 它们都是类型,且它们是标量数量而不是向量数量。They are both type and they are scalar quantities rather than vector ones.

具有房屋价格预测数据的 ML.NET 数据视图示例

所有 ML.NET 算法都在寻找属于向量的输入列。All ML.NET algorithms look for an input column that is a vector. 默认情况下,此向量列称为特征By default this vector column is called Features. 这就是我们在房屋价格示例中将大小列连接到名为特征的新列中的原因。This is why we concatenated the Size column into a new column called Features in our house price example.

   var pipeline = mlContext.Transforms.Concatenate("Features", new[] { "Size" })

所有算法在执行预测后还会创建新列。All algorithms also create new columns after they have performed a prediction. 这些新列的固定名称取决于机器学习算法的类型。The fixed names of these new columns depend on the type of machine learning algorithm. 对于回归任务,其中一个新列称为分数For the regression task, one of the new columns is called Score. 这就是我们将价格数据归为此名称的原因。This is why we attributed our price data with this name.

    public class Prediction
    {
        [ColumnName("Score")]
        public float Price { get; set; }
    }

可以在机器学习任务指南中找到有关不同机器学习任务的输出列的详细信息。You can find out more about output columns of different machine learning tasks in the Machine Learning Tasks guide.

DataView 对象的一个​​重要属性是它们被惰性求值。An important property of DataView objects is that they are evaluated lazily. 数据视图仅在模型训练和评估以及数据预测期间加载及运行。Data views are only loaded and operated on during model training and evaluation, and data prediction. 在编写和测试 ML.NET 应用程序时,可以使用 Visual Studio 调试程序通过调用 Preview 方法来浏览任何数据视图对象。While you are writing and testing your ML.NET application, you can use the Visual Studio debugger to take a peek at any data view object by calling the Preview method.

    var debug = testPriceDataView.Preview();

可以在调试程序中查看 debug 变量并检查其内容。You can watch the debug variable in the debugger and examine its contents. 不要在生产代码中使用 Preview 方法,因为它会大幅降低性能。Do not use the Preview method in production code, as it significantly degrades performance.

模型部署Model Deployment

在实际应用程序中,模型训练和评估代码将与预测分离。In real-life applications, your model training and evaluation code will be separate from your prediction. 事实上,这两项活动通常由单独的团队执行。In fact, these two activities are often performed by separate teams. 模型开发团队可以保存模型以便用于预测应用程序。Your model development team can save the model for use in the prediction application.

   mlContext.Model.Save(model, trainingData.Schema,"model.zip");

后续步骤Next steps

  • 教程中了解如何使用不同的机器学习任务和更实际的数据集来生成应用程序。Learn how to build applications using different machine learning tasks with more realistic data sets in the tutorials.

  • 操作指南中更深入地了解特定主题。Learn about specific topics in more depth in the How To Guides.

  • 如果非常感兴趣,可以直接阅读 API 参考文档If you're super keen, you can dive straight into the API Reference documentation.