ML.NET 中的机器学习任务Machine learning tasks in ML.NET

机器学习任务是根据所询问的问题和可用数据进行的预测或推理的类型。A machine learning task is the type of prediction or inference being made, based on the problem or question that is being asked, and the available data. 例如,分类任务将数据分配给类别,聚类分析任务根据相似性对数据进行分组。For example, the classification task assigns data to categories, and the clustering task groups data according to similarity.

机器学习任务依赖于数据中的模式,而不是显式编程。Machine learning tasks rely on patterns in the data rather than being explicitly programmed.

本文介绍了可以从 ML.NET 中选择的不同机器学习任务,以及一些常见用例。This article describes the different machine learning tasks that you can choose from in ML.NET and some common use cases.

决定适合场景的任务后,则需要选择最佳算法来训练模型。Once you have decided which task works for your scenario, then you need to choose the best algorithm to train your model. 本节列出了每个任务的可用算法。The available algorithms are listed in the section for each task.

二元分类Binary classification

监管式机器学习任务,用于预测数据实例所属的两个类(类别)。A supervised machine learning task that is used to predict which of two classes (categories) an instance of data belongs to. 分类算法输入是一组标记示例,其中每个标签为整数 0 或 1。The input of a classification algorithm is a set of labeled examples, where each label is an integer of either 0 or 1. 二元分类算法的输出是一个分类器,可用于预测未标记的新实例的类。The output of a binary classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. 二元分类方案示例包括:Examples of binary classification scenarios include:

  • 了解 Twitter 评论的情绪,“正面”或“负面”。Understanding sentiment of Twitter comments as either "positive" or "negative".
  • 诊断患者是否患有某种疾病。Diagnosing whether a patient has a certain disease or not.
  • 决定是否要将电子邮件标记为“垃圾邮件”。Making a decision to mark an email as "spam" or not.
  • 确定照片是否包含特定项,例如狗或水果。Determining if a photo contains a particular item or not, such as a dog or fruit.

有关详细信息,请参阅 Wikipedia 上的二元分类一文。For more information, see the Binary classification article on Wikipedia.

二元分类训练程序Binary classification trainers

可以使用以下算法训练二元分类模型:You can train a binary classification model using the following algorithms:

二元分类输入和输出Binary classification inputs and outputs

为了通过二元分类获得最佳结果,应平衡训练数据(即正训练数据和负训练数据的数量相等)。For best results with binary classification, the training data should be balanced (that is, equal numbers of positive and negative training data). 应在训练前处理缺失值。Missing values should be handled before training.

输入标签列数据必须为 BooleanThe input label column data must be Boolean. 输入特征列数据必须为 Single 的固定大小向量。The input features column data must be a fixed-size vector of Single.

这些训练程序将输出以下列:These trainers output the following columns:

输出列名称Output Column Name 列名称Column Type 描述Description
Score Single 由模型计算得出的原始分数The raw score that was calculated by the model
PredictedLabel Boolean 预测的标签,基于分数符号。The predicted label, based on the sign of the score. 负分数映射到 false,正分数映射到 trueA negative score maps to false and a positive score maps to true.

多类分类Multiclass classification

监管式机器学习任务,用于预测数据实例的类(类别)。A supervised machine learning task that is used to predict the class (category) of an instance of data. 分类算法输入是一组标记示例。The input of a classification algorithm is a set of labeled examples. 每个标签通常以文本形式开始。Each label normally starts as text. 然后通过 TermTransform 运行,它可将其转换为 Key(数字)类型。It is then run through the TermTransform, which converts it to the Key (numeric) type. 分类算法的输出是一个分类器,可用于预测未标记的新实例的类。The output of a classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. 多类分类方案示例包括:Examples of multi-class classification scenarios include:

  • 确定狗的品种是“西伯利亚哈士奇”、“金毛寻回犬”、“贵宾犬”等。Determining the breed of a dog as a "Siberian Husky", "Golden Retriever", "Poodle", etc.
  • 了解电影评论是“正面”、“中立”还是“负面”。Understanding movie reviews as "positive", "neutral", or "negative".
  • 将酒店评语分类为“位置”、“价格”、“整洁度”等。Categorizing hotel reviews as "location", "price", "cleanliness", etc.

有关详细信息,请参阅 Wikipedia 上的多类分类一文。For more information, see the Multiclass classification article on Wikipedia.

备注

一个与所有升级任何二元分类学习器,以便对多类数据集进行操作。One vs all upgrades any binary classification learner to act on multiclass datasets. 有关详细信息,请参阅 [Wikipedia] (https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) 。More information on [Wikipedia] (https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest).

多类分类训练程序Multiclass classification trainers

可以使用以下训练算法训练多类分类模型:You can train a multiclass classification model using the following training algorithms:

多类分类输入和输出Multiclass classification inputs and outputs

输入标签列数据必须为 key 类型。The input label column data must be key type. 特征列必须为 Single 的固定大小向量。The feature column must be a fixed size vector of Single.

该训练程序输出以下列:This trainer outputs the following:

输出名称Output Name 类型Type 描述Description
Score Single 的向量Vector of Single 所有类的分数。The scores of all classes. 值越高意味着落入相关类的概率越高。Higher value means higher probability to fall into the associated class. 如果第 i 个元素具有最大值,则预测的标签索引为 i。If the i-th element has the largest value, the predicted label index would be i. 请注意,i 是从零开始的索引。Note that i is zero-based index.
PredictedLabel key 类型key type 预测标签的索引。The predicted label's index. 如果其值为 i,则实际标签为键值输入标签类型中的第 i 个类别。If its value is i, the actual label would be the i-th category in the key-valued input label type.

回归测试Regression

监管式机器学习任务,用于从一组相关特征中预测标签值。A supervised machine learning task that is used to predict the value of the label from a set of related features. 标签可以是任何实际值,而不是像在分类任务中那样来自一组有限的值。The label can be of any real value and is not from a finite set of values as in classification tasks. 回归算法模拟其相关特征上的标签依赖关系,以确定标签将如何随着特征值的变化而变化。Regression algorithms model the dependency of the label on its related features to determine how the label will change as the values of the features are varied. 回归算法输入是一组带已知值标签的示例。The input of a regression algorithm is a set of examples with labels of known values. 回归算法输出是一个函数,可用于预测任何一组新输入特征的标签值。The output of a regression algorithm is a function, which you can use to predict the label value for any new set of input features. 回归方案示例包括:Examples of regression scenarios include:

  • 基于房子特性(如卧室数量、位置或大小)来预测房价。Predicting house prices based on house attributes such as number of bedrooms, location, or size.
  • 基于历史数据和当前市场趋势预测将来的股票价格。Predicting future stock prices based on historical data and current market trends.
  • 基于广告预算预测产品销售。Predicting sales of a product based on advertising budgets.

回归训练程序Regression trainers

可以使用以下算法训练回归模型:You can train a regression model using the following algorithms:

回归输入和输出Regression inputs and outputs

输入标签列数据必须为 SingleThe input label column data must be Single.

此任务的训练程序输出以下列:The trainers for this task output the following:

输出名称Output Name 类型Type 描述Description
Score Single 模型预测的原始分数The raw score that was predicted by the model

聚类分析Clustering

非监管式机器学习任务,用于将数据实例分组到包含类似特性的群集。An unsupervised machine learning task that is used to group instances of data into clusters that contain similar characteristics. 聚类分析还可用来识别可能无法通过浏览或简单的观察以逻辑方式推导出的数据集中的关系。Clustering can also be used to identify relationships in a dataset that you might not logically derive by browsing or simple observation. 聚类分析算法的输入和输出取决于选择的方法。The inputs and outputs of a clustering algorithm depends on the methodology chosen. 可以采取分发、质心、连接或基于密度的方法。You can take a distribution, centroid, connectivity, or density-based approach. ML.NET 当前支持使用 K 平均值聚类分析的基于质心的方法。ML.NET currently supports a centroid-based approach using K-Means clustering. 聚类分析方案示例包括:Examples of clustering scenarios include:

  • 基于酒店选择的习惯和特征来了解酒店来宾群。Understanding segments of hotel guests based on habits and characteristics of hotel choices.
  • 确定客户群和人口统计信息来帮助生成目标广告活动。Identifying customer segments and demographics to help build targeted advertising campaigns.
  • 基于生产指标对清单进行分类。Categorizing inventory based on manufacturing metrics.

聚类分析训练程序Clustering trainer

可以使用以下算法训练聚类分析模型:You can train a clustering model using the following algorithm:

聚类分析输入和输出Clustering inputs and outputs

输入特征数据必须为 SingleThe input features data must be Single. 无需标签。No labels are needed.

该训练程序输出以下列:This trainer outputs the following:

输出名称Output Name 类型Type 描述Description
Score Single 的向量vector of Single 给定数据点到所有群集的质心的距离The distances of the given data point to all clusters' centriods
PredictedLabel key 类型key type 模型预测的最接近的群集的索引。The closest cluster's index predicted by the model.

异常情况检测Anomaly detection

此任务使用主体组件分析 (PCA) 创建异常情况检测模型。This task creates an anomaly detection model by using Principal Component Analysis (PCA). 基于 PCA 的异常情况检测有助于在以下场景中构建模型:可以很轻松地从一个类中获得定型数据(例如有效事务),但难以获得目标异常的足够示例。PCA-Based Anomaly Detection helps you build a model in scenarios where it is easy to obtain training data from one class, such as valid transactions, but difficult to obtain sufficient samples of the targeted anomalies.

PCA 是机器学习中已建立的一种技术,由于它揭示了数据的内部结构,并解释了数据中的差异,因此经常被用于探索性数据分析。An established technique in machine learning, PCA is frequently used in exploratory data analysis because it reveals the inner structure of the data and explains the variance in the data. PCA 的工作方式是通过分析包含多个变量的数据。PCA works by analyzing data that contains multiple variables. 它查找变量之间的关联性,并确定最能捕捉结果差异的值的组合。It looks for correlations among the variables and determines the combination of values that best captures differences in outcomes. 这些组合的特性值用于创建一个更紧凑的特性空间,称为主体组件。These combined feature values are used to create a more compact feature space called the principal components.

异常情况检测包含机器学习中的许多重要任务:Anomaly detection encompasses many important tasks in machine learning:

  • 识别潜在的欺诈交易。Identifying transactions that are potentially fraudulent.
  • 指示发生了网络入侵的学习模式。Learning patterns that indicate that a network intrusion has occurred.
  • 发现异常的患者群集。Finding abnormal clusters of patients.
  • 检查输入系统的值。Checking values entered into a system.

根据定义,异常情况属于罕见事件,因此很难收集具有代表性的数据样本用于建模。Because anomalies are rare events by definition, it can be difficult to collect a representative sample of data to use for modeling. 此类别中包含的算法是专门设计用来解决使用不平衡数据集建立和定型模型的核心挑战。The algorithms included in this category have been especially designed to address the core challenges of building and training models by using imbalanced data sets.

异常情况检测训练程序Anomaly detection trainer

可以使用以下算法训练异常情况检测模型:You can train an anomaly detection model using the following algorithm:

异常情况检测输入和输出Anomaly detection inputs and outputs

输入特征必须为 Single 的固定大小向量。The input features must be a fixed-sized vector of Single.

该训练程序输出以下列:This trainer outputs the following:

输出名称Output Name 类型Type 描述Description
Score Single 由异常情况检测模型计算得出的非负无界分数The non-negative, unbounded score that was calculated by the anomaly detection model
PredictedLabel Boolean true/false 值表示输入是否异常 (PredictedLabel=true) 或 (PredictedLabel=false)A true/false value representing whether the input is an anomaly (PredictedLabel=true) or not (PredictedLabel=false)

排名Ranking

排名任务从一组标记的示例构建排名程序。A ranking task constructs a ranker from a set of labeled examples. 该示例集由实例组组成,这些实例组可以使用给定的标准进行评分。This example set consists of instance groups that can be scored with a given criteria. 每个实例的排名标签是 { 0, 1, 2, 3, 4 }。The ranking labels are { 0, 1, 2, 3, 4 } for each instance. 排名程序定型为用每个实例的未知分数对新实例组进行排名。The ranker is trained to rank new instance groups with unknown scores for each instance. ML.NET 排名学习器基于机器已学习的排名ML.NET ranking learners are machine learned ranking based.

排名训练算法Ranking training algorithms

可以使用以下算法训练排名模型:You can train a ranking model with the following algorithms:

排名输入和输出Ranking input and outputs

输入标签数据类型必须为 key 类型或 SingleThe input label data type must be key type or Single. 标签的值决定相关性,其中较高的值表示较高的相关性。The value of the label determines relevance, where higher values indicate higher relevance. 如果标签为 key 类型,则键索引为相关性值,其中最小索引是最不相关的。If the label is a key type, then the key index is the relevance value, where the smallest index is the least relevant. 如果标签为 Single,则较大的值表示较高的相关性。If the label is a Single, larger values indicate higher relevance.

特征数据必须为 Single 的固定大小向量,输入行组列必须为 key 类型。The feature data must be a fixed size vector of Single and input row group column must be key type.

该训练程序输出以下列:This trainer outputs the following:

输出名称Output Name 类型Type 描述Description
Score Single 由模型计算以确定预测的无界分数The unbounded score that was calculated by the model to determine the prediction

建议Recommendation

推荐任务支持生成推荐产品或服务的列表。A recommendation task enables producing a list of recommended products or services. ML.NET 使用矩阵因子分解 (MF),这是一种协作筛选算法,当目录中有历史产品评级数据时,推荐使用该算法。ML.NET uses Matrix factorization (MF), a collaborative filtering algorithm for recommendations when you have historical product rating data in your catalog. 例如,你为用户提供历史电影评级数据,并希望向他们推荐接下来可能观看的其他电影。For example, you have historical movie rating data for your users and want to recommend other movies they are likely to watch next.

建议训练算法Recommendation training algorithms

可以使用以下算法训练建议模型:You can train a recommendation model with the following algorithm:

预测Forecasting

预测任务使用过去的时序数据来预测将来的行为。The forecasting task use past time-series data to make predictions about future behavior. 适用于预测的场景包括天气预测、季节性销售预测和预测维护。Scenarios applicable to forecasting include weather forecasting, seasonal sales predictions, and predictive maintenance,

预测训练器Forecasting trainers

可以使用以下算法训练预测模型:You can train a forecasting model with the following algorithm:

ForecastBySsa