如何选择 ML.NET 算法How to choose an ML.NET algorithm

对于每个 ML.NET 任务,有多种训练算法可供选择。For each ML.NET task, there are multiple training algorithms to choose from. 选择哪个算法取决于尝试解决的问题、数据的特征以及可用的计算和存储资源。Which one to choose depends on the problem you are trying to solve, the characteristics of your data, and the compute and storage resources you have available. 值得注意的是,训练机器学习模型是一个迭代过程。It is important to note that training a machine learning model is an iterative process. 可能需要尝试多种算法才能找到效果最好的算法。You might need to try multiple algorithms to find the one that works best.

算法在特征上运行。Algorithms operate on features. 特征是根据输入数据进行计算的数字值。Features are numerical values computed from your input data. 它们是机器学习算法的最佳输入。They are optimal inputs for machine learning algorithms. 可以使用一个或多个数据转换将原始输入数据转换为特征。You transform your raw input data into features using one or more data transforms. 例如,文本数据被转换为一组字词计数和字词组合计数。For example, text data is transformed into a set of word counts and word combination counts. 使用数据转换从原始数据类型中提取特征后,它们被称为特征化Once the features have been extracted from a raw data type using data transforms, they are referred to as featurized. 例如,特征化文本或特征化图像数据。For example, featurized text, or featurized image data.

训练程序 = 算法 + 任务Trainer = Algorithm + Task

算法是执行后可生成模型的数学运算。An algorithm is the math that executes to produce a model. 不同的算法生成具有不同特征的模型。Different algorithms produce models with different characteristics.

借助 ML.NET,同一算法可以应用于不同的任务。With ML.NET, the same algorithm can be applied to different tasks. 例如,随机双坐标上升可用于二元分类、多类分类和回归。For example, Stochastic Dual Coordinated Ascent can be used for Binary Classification, Multiclass Classification, and Regression. 区别在于如何解释算法的输出来匹配任务。The difference is in how the output of the algorithm is interpreted to match the task.

对于每个算法/任务组合,ML.NET 提供执行训练算法并进行解释的组件。For each algorithm/task combination, ML.NET provides a component that executes the training algorithm and does the interpretation. 这些组件称为训练程序。These components are called trainers. 例如,SdcaRegressionTrainer 使用应用于回归任务的 StochasticDualCoordinatedAscent 算法。For example, the SdcaRegressionTrainer uses the StochasticDualCoordinatedAscent algorithm applied to the Regression task.

线性算法Linear algorithms

线性算法生成一个模型,该模型根据输入数据和一组权重的线性组合计算分数Linear algorithms produce a model that calculates scores from a linear combination of the input data and a set of weights. 权重是训练期间估算的模型参数。The weights are parameters of the model estimated during training.

线性算法适用于线性可分的特征。Linear algorithms work well for features that are linearly separable.

在使用线性算法进行训练之前,应对特征进行规范化。Before training with a linear algorithm, the features should be normalized. 这样可防止某个特征对结果产生比其他特征更多的影响。This prevents one feature having more influence over the result than others.

一般而言,线性算法可缩放且速度快,训练和预测费用也很低。In general linear algorithms are scalable and fast, cheap to train, cheap to predict. 它们按特征数量进行缩放,并按训练数据集的大小粗略进行缩放。They scale by the number of features and approximately by the size of the training data set.

线性算法对训练数据进行多次传递。Linear algorithms make multiple passes over the training data. 如果数据集适用于内存,则在追加训练程序之前向 ML.NET 管道添加缓存检查点将使训练运行速度加快。If your dataset fits into memory, then adding a cache checkpoint to your ML.NET pipeline before appending the trainer, will make the training run faster.

线性训练程序Linear Trainers

算法Algorithm 属性Properties 训练程序Trainers
平均感知器Averaged perceptron 最适合用于文本分类Best for text classification AveragedPerceptronTrainer
随机双坐标上升Stochastic dual coordinated ascent 默认性能良好,不需要调整Tuning not needed for good default performance SdcaLogisticRegressionBinaryTrainer SdcaNonCalibratedBinaryTrainer SdcaMaximumEntropyMulticlassTrainer SdcaNonCalibratedMulticlassTrainer SdcaRegressionTrainerSdcaLogisticRegressionBinaryTrainer SdcaNonCalibratedBinaryTrainer SdcaMaximumEntropyMulticlassTrainer SdcaNonCalibratedMulticlassTrainer SdcaRegressionTrainer
L-BFGSL-BFGS 在有大量特征时使用。Use when number of features is large. 生成逻辑回归训练统计数据,但缩放性能不如 AveragedPerceptronTrainerProduces logistic regression training statistics, but doesn't scale as well as the AveragedPerceptronTrainer LbfgsLogisticRegressionBinaryTrainer LbfgsMaximumEntropyMulticlassTrainer LbfgsPoissonRegressionTrainerLbfgsLogisticRegressionBinaryTrainer LbfgsMaximumEntropyMulticlassTrainer LbfgsPoissonRegressionTrainer
符号随机梯度下降Symbolic stochastic gradient descent 最快速、最准确的线性二元分类训练程序。Fastest and most accurate linear binary classification trainer. 可随处理器数量很好地缩放Scales well with number of processors SymbolicSgdLogisticRegressionBinaryTrainer

决策树算法Decision tree algorithms

决策树算法创建包含一系列决策的模型:实际上是包含数据值的流程图。Decision tree algorithms create a model that contains a series of decisions: effectively a flow chart through the data values.

特征不需要线性可分即可使用此类算法。Features do not need to be linearly separable to use this type of algorithm. 特征无需规范化,因为特征向量中的各个值在决策过程中独立使用。And features do not need to be normalized, because the individual values in the feature vector are used independently in the decision process.

决策树算法通常非常准确。Decision tree algorithms are generally very accurate.

除广义加性模型 (GAM) 之外,在存在大量特征的情况下,树模型可能缺少可解释性。Except for Generalized Additive Models (GAMs), tree models can lack explainability when the number of features is large.

决策树算法需要更多资源,并且缩放性能不如线性算法。Decision tree algorithms take more resources and do not scale as well as linear ones do. 它们在适用于内存的数据集中拥有良好性能。They do perform well on datasets that can fit into memory.

提升决策树是一组小型决策树,其中每个决策树对输入数据进行评分并将分数传递到下一个决策树来生成更好的分数,以此类推,其中每个决策树都会在之前决策树的基础上有所改进。Boosted decision trees are an ensemble of small trees where each tree scores the input data and passes the score onto the next tree to produce a better score, and so on, where each tree in the ensemble improves on the previous.

决策树训练程序Decision tree trainers

算法Algorithm 属性Properties 训练程序Trainers
轻型梯度增强机Light gradient boosted machine 最快速、最准确的二元分类树训练程序。Fastest and most accurate of the binary classification tree trainers. 高度可调Highly tunable LightGbmBinaryTrainer LightGbmMulticlassTrainer LightGbmRegressionTrainer LightGbmRankingTrainerLightGbmBinaryTrainer LightGbmMulticlassTrainer LightGbmRegressionTrainer LightGbmRankingTrainer
快速决策树Fast tree 用于特征化图像数据。Use for featurized image data. 在非均衡数据方面具有弹性。Resilient to unbalanced data. 高度可调Highly tunable FastTreeBinaryTrainer FastTreeRegressionTrainer FastTreeTweedieTrainer FastTreeRankingTrainerFastTreeBinaryTrainer FastTreeRegressionTrainer FastTreeTweedieTrainer FastTreeRankingTrainer
快速林Fast forest 适用于干扰性数据Works well with noisy data FastForestBinaryTrainer FastForestRegressionTrainerFastForestBinaryTrainer FastForestRegressionTrainer
广义加性模型 (GAM)Generalized additive model (GAM) 最适合用于使用决策树算法时表现良好但可解释性为优先事项的问题Best for problems that perform well with tree algorithms but where explainability is a priority GamBinaryTrainer GamRegressionTrainerGamBinaryTrainer GamRegressionTrainer

矩阵分解Matrix factorization

属性Properties 训练程序Trainers
最适合用于具有大型数据集的稀疏分类数据Best for sparse categorical data, with large datasets FieldAwareFactorizationMachineTrainer

元算法Meta algorithms

这些训练程序根据二元训练程序创建多类训练程序。These trainers create a multi-class trainer from a binary trainer. AveragedPerceptronTrainerLbfgsLogisticRegressionBinaryTrainerSymbolicSgdLogisticRegressionBinaryTrainerLightGbmBinaryTrainerFastTreeBinaryTrainerFastForestBinaryTrainerGamBinaryTrainer 配合使用。Use with AveragedPerceptronTrainer, LbfgsLogisticRegressionBinaryTrainer, SymbolicSgdLogisticRegressionBinaryTrainer, LightGbmBinaryTrainer, FastTreeBinaryTrainer, FastForestBinaryTrainer, GamBinaryTrainer.

算法Algorithm 属性Properties 训练程序Trainers
一对多One versus all 此多类分类器为每个类训练一个二元分类器,这可将该类与所有其他类区分开来。This multiclass classifier trains one binary classifier for each class, which distinguishes that class from all other classes. 规模因要分类的类的数量而受到限制Is limited in scale by the number of classes to categorize OneVersusAllTrainer<BinaryClassificationTrainer>OneVersusAllTrainer<BinaryClassificationTrainer>
成对耦合Pairwise coupling 此多类分类器在每对类上训练二元分类算法。This multiclass classifier trains a binary classification algorithm on each pair of classes. 规模因类的数量而受到限制,因为必须训练每个两个类的组合。Is limited in scale by the number of classes, as each combination of two classes must be trained. PairwiseCouplingTrainer<BinaryClassificationTrainer>PairwiseCouplingTrainer<BinaryClassificationTrainer>

K-MeansK-Means

属性Properties 训练程序Trainers
用于聚类分析Use for clustering KMeansTrainer

主体组件分析Principal component analysis

属性Properties 训练程序Trainers
用于异常情况检测Use for anomaly detection RandomizedPcaTrainer

朴素贝叶斯Naive Bayes

属性Properties 训练程序Trainers
当特征独立且训练数据集很小时,请使用此多类分类训练程序。Use this multi-class classification trainer when the features are independent, and the training dataset is small. NaiveBayesMulticlassTrainer

前期训练程序Prior Trainer

属性Properties 训练程序Trainers
使用此二元分类训练程序来确定其他训练程序的性能基线。Use this binary classification trainer to baseline the performance of other trainers. 其他训练程序的指标应优于前期训练程序才能成为有效指标。To be effective, the metrics of the other trainers should be better than the prior trainer. PriorTrainer