您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure 机器学习工作室机器学习算法备忘单Machine learning algorithm cheat sheet for Azure Machine Learning Studio

Azure 机器学习工作室算法备忘单可帮助为预测分析模型选择合适的算法。The Azure Machine Learning Studio Algorithm Cheat Sheet helps you choose the right algorithm for a predictive analytics model.

Azure 机器学习工作室拥有一个大型算法库,包括回归、分类、聚类分析和异常检测系列。Azure Machine Learning Studio has a large library of algorithms from the regression, classification, clustering, and anomaly detection families. 每一类算法都可用于解决一种类型的机器学习问题。Each is designed to address a different type of machine learning problem.

下载:机器学习算法备忘单Download: Machine learning algorithm cheat sheet

在此下载备忘单:机器学习算法备忘单(11 x 17 英寸)Download the cheat sheet here: Machine Learning Algorithm Cheat Sheet (11x17 in.)


下载该机器学习工作室算法备忘单,并将其打印为 Tabloid 大小,既方便携带又可帮助你选择算法。Download and print the Machine Learning Studio Algorithm Cheat Sheet in tabloid size to keep it handy and get help choosing an algorithm.


有关如何使用此备忘单选择合适算法的帮助,以及不同类型的机器学习算法及其使用方式的深入探讨,请参阅如何选择 Microsoft Azure 机器学习工作室算法For help in using this cheat sheet for choosing the right algorithm, plus a deeper discussion of the different types of machine learning algorithms and how they're used, see How to choose algorithms for Microsoft Azure Machine Learning Studio.

机器学习工作室算法备忘单的说明和术语定义Notes and terminology definitions for the Machine Learning Studio algorithm cheat sheet

  • 此算法备忘单中提供的建议近似于经验法则。The suggestions offered in this algorithm cheat sheet are approximate rules-of-thumb. 一些可以不完全照做,一些可以大胆地违反。Some can be bent, and some can be flagrantly violated. 它旨在建议一个起点。This is intended to suggest a starting point. 不要担心几种算法之间对数据运行正面竞争。Don’t be afraid to run a head-to-head competition between several algorithms on your data. 没有只是没有什么可以替代理解每种算法并生成你的数据的系统的原则。There is simply no substitute for understanding the principles of each algorithm and the system that generated your data.

  • 每种机器学习算法都有自己的样式或归纳偏差Every machine learning algorithm has its own style or inductive bias. 对于特定问题,可能有几种算法合适,但会有一种算法可能比其他算法更合适。For a specific problem, several algorithms may be appropriate and one algorithm may be a better fit than others. 但并非总是可以预先知道哪种是最合适的。But it's not always possible to know beforehand which is the best fit. 在这些情况下,会在备忘单中列出几种算法。In cases like these, several algorithms are listed together in the cheat sheet. 适当的策略是尝试一种算法,如果结果尚不令人满意,则尝试其他算法。An appropriate strategy would be to try one algorithm, and if the results are not yet satisfactory, try the others. 下面是 Azure AI 库中的一个试验示例。该试验对相同数据尝试多种算法并对结果进行比较:比较多类分类器:字母识别Here’s an example from the Azure AI Gallery of an experiment that tries several algorithms against the same data and compares the results: Compare Multi-class Classifiers: Letter recognition.

  • 有三种主要类别的机器学习:监督式学习非监督式学习强化学习There are three main categories of machine learning: supervised learning, unsupervised learning, and reinforcement learning.

    • 监督式学习中,将标记每个数据点或将其与某个类别或相关值相关联。In supervised learning, each data point is labeled or associated with a category or value of interest. 分类标签的示例是将图像分配为“猫”或“狗”。An example of a categorical label is assigning an image as either a ‘cat’ or a ‘dog’. 值标签的示例是与二手车关联的销售价格。An example of a value label is the sale price associated with a used car. 监督式学习的目的是研究大量类似这样的标记示例,并能够对未来的数据点进行预测。The goal of supervised learning is to study many labeled examples like these, and then to be able to make predictions about future data points. 例如,识别包含正确动物的新照片或为其他二手车指定准确的销售价格。For example, identifying new photos with the correct animal or assigning accurate sale prices to other used cars. 这是一种常用且有用的机器学习类型。This is a popular and useful type of machine learning. K-Means 群集外,Azure 机器学习工作室中的所有其他模块都采用监督式学习算法。All of the modules in Azure Machine Learning Studio are supervised learning algorithms except for K-Means Clustering.

    • 非监督式学习中,数据点没有与其关联的标签。In unsupervised learning, data points have no labels associated with them. 相反,非监督式学习算法的目的是以某种方式组织数据或者说明其结构。Instead, the goal of an unsupervised learning algorithm is to organize the data in some way or to describe its structure. 这意味着将其分组到群集(如 K-means 所实现)或查找不同的方法来查看复杂数据以使其显示更简单。This can mean grouping it into clusters, as K-means does, or finding different ways of looking at complex data so that it appears simpler.

    • 强化学习中,算法需选择响应每个数据点的操作。In reinforcement learning, the algorithm gets to choose an action in response to each data point. 它是机器人学中的常见方法,在此技术中一个时间点的传感器读数集是数据点,并且算法必须选择机器人的下一个动作。It is a common approach in robotics, where the set of sensor readings at one point in time is a data point, and the algorithm must choose the robot’s next action. 它也是物联网应用程序的理想选择。It's also a natural fit for Internet of Things applications. 学习算法还会在短时间后收到奖励信号,指示决策的优秀程度。The learning algorithm also receives a reward signal a short time later, indicating how good the decision was. 在此基础上,算法会修改其战略议获得最高奖励。Based on this, the algorithm modifies its strategy in order to achieve the highest reward. 目前,Azure 机器学习工作室中暂无强化学习算法模块。Currently there are no reinforcement learning algorithm modules in Azure Machine Learning studio.

  • 贝叶斯方法对统计上独立的数据点做出假设。Bayesian methods make the assumption of statistically independent data points. 这意味着一个数据点中的未建模变化与其他数据点不相关,也就是说,无法预测。This means that the unmodeled variability in one data point is uncorrelated with others, that is, it can’t be predicted. 例如,如果所记录的数据是下一班地铁到达之前的分钟数,将一天分开的两个度量值在统计上是不相关的。For example, if the data being recorded is the number of minutes until the next subway train arrives, two measurements taken a day apart are statistically independent. 但是,将一分钟分开的两个度量值在统计上不是不相关的 - 一个的值能够高度预测另一个的值。However, two measurements taken a minute apart are not statistically independent - the value of one is highly predictive of the value of the other.

  • 提升决策树回归利用了功能重叠或功能间的交互。Boosted decision tree regression takes advantage of feature overlap or interaction among features. 这意味着,在任何给定的数据点,在某种程度上能够根据一个功能的值预测另一个功能的值。That means that, in any given data point, the value of one feature is somewhat predictive of the value of another. 例如,在每日高/低温度数据中,知道一天的低温度让可以合理地猜测该天的高温度。For example, in daily high/low temperature data, knowing the low temperature for the day allows you to make a reasonable guess at the high. 这两个功能中包含的信息在某种程度上是冗余的。The information contained in the two features is somewhat redundant.

  • 对数据分类到两个以上的类别可以使用的本质上是多类分类器,或通过将一组双类分类器组合完成系综Classifying data into more than two categories can be done either by using an inherently multi-class classifier, or by combining a set of two-class classifiers into an ensemble. 在集合方法中,每个类有一个单独的双类分类器 - 每个分类器将数据分成两个类别:“此类”和“非此类”。In the ensemble approach, there is a separate two-class classifier for each class - each one separates the data into two categories: “this class” and “not this class.” 然后,这些分类器为数据点是否正确分配投票。Then these classifiers vote on the correct assignment of the data point. 这是 One-vs-All Multiclass 背后的操作原理。This is the operational principle behind One-vs-All Multiclass.

  • 包括逻辑回归和贝叶斯点机在内的多种方法都假定线性类边界Several methods, including logistic regression and the Bayes point machine, assume linear class boundaries. 也就是说,它们假设类之间的边界近似于直线(或在更普遍的情况下是超平面)。That is, they assume that the boundaries between classes are approximately straight lines (or hyperplanes in the more general case). 通常这是在尝试分隔它之后才知道的数据特征,但这通常可以通过事先进行可视化来了解。Often this is a characteristic of the data that you don’t know until after you’ve tried to separate it, but it’s something that typically can be learned by visualizing beforehand. 如果类边界看起来非常不规则,可继续使用决策树、决策森林、支持向量机或神经网络。If the class boundaries look very irregular, stick with decision trees, decision jungles, support vector machines, or neural networks.

  • 可以将神经网络与类别变量配合使用,方法是为每个类别创建一个虚拟变量,并在类别适用时将该变量设置为 1,不适用时将该变量设置为 0。Neural networks can be used with categorical variables by creating a dummy variable for each category, setting it to 1 in cases where the category applies, 0 where it doesn’t.

后续步骤Next steps