如何選取 Azure Machine Learning 的演算法How to select algorithms for Azure Machine Learning

常見的問題是「我應該使用哪一種機器學習服務演算法?」A common question is “Which machine learning algorithm should I use?” 您所選取的演算法主要取決於資料科學案例的兩個不同層面:The algorithm you select depends primarily on two different aspects of your data science scenario:

  • 您要如何處理資料?What you want to do with your data? 具體而言,從過去的資料學習,您想要回答的商務問題是什麼?Specifically, what is the business question you want to answer by learning from your past data?

  • 您的資料科學案例有哪些需求?What are the requirements of your data science scenario? 具體而言,您的解決方案所支援的精確度、定型時間、線性、參數數目和功能數目為何?Specifically, what is the accuracy, training time, linearity, number of parameters, and number of features your solution supports?

選擇演算法的考慮:您要知道什麼?

商務案例和 Machine Learning 演算法功能提要Business scenarios and the Machine Learning Algorithm Cheat Sheet

Azure Machine Learning 演算法的功能提要可協助您進行第一次考慮: 您要 如何處理資料?The Azure Machine Learning Algorithm Cheat Sheet helps you with the first consideration: What you want to do with your data ? 在 Machine Learning 演算法的功能提要上,尋找您想要進行的工作,然後尋找預測性分析解決方案的 Azure Machine Learning 設計 工具演算法。On the Machine Learning Algorithm Cheat Sheet, look for task you want to do, and then find a Azure Machine Learning designer algorithm for the predictive analytics solution.

Machine Learning 設計工具提供完整的演算法組合,例如 多元決策樹系、 建議系統類神經網路回歸多元類神經網路,以及 K 表示叢集。Machine Learning designer provides a comprehensive portfolio of algorithms, such as Multiclass Decision Forest, Recommendation systems, Neural Network Regression, Multiclass Neural Network, and K-Means Clustering. 每個演算法都是設計來解決不同類型的機器學習服務問題。Each algorithm is designed to address a different type of machine learning problem. 如需完整清單的詳細資訊,請參閱 Machine Learning 設計工具演算法和模組參考 ,以及有關每個演算法如何運作以及如何調整參數以將演算法優化的相關檔。See the Machine Learning designer algorithm and module reference for a complete list along with documentation about how each algorithm works and how to tune parameters to optimize the algorithm.

注意

若要下載機器學習演算法的功能提要,請移至 Azure machine learning 演算法功能提要頁面。To download the machine learning algorithm cheat sheet, go to Azure Machine learning algorithm cheat sheet.

除了 Azure Machine Learning 演算法的相關指引,也請記住在為方案選擇機器學習演算法時的其他需求。Along with guidance in the Azure Machine Learning Algorithm Cheat Sheet, keep in mind other requirements when choosing a machine learning algorithm for your solution. 以下是要考慮的其他因素,例如精確度、定型時間、線性、參數數目和功能數目。Following are additional factors to consider, such as the accuracy, training time, linearity, number of parameters and number of features.

機器學習演算法的比較Comparison of machine learning algorithms

有些學習演算法會對資料結構或想要的結果做出特定假設。Some learning algorithms make particular assumptions about the structure of the data or the desired results. 如果可以找到符合需求的假設,您就能獲得更實用的結果、更精確的預測或更快的定型時間。If you can find one that fits your needs, it can give you more useful results, more accurate predictions, or faster training times.

下表摘要說明來自分類、回歸和叢集系列之演算法的一些最重要特性:The following table summarizes some of the most important characteristics of algorithms from the classification, regression, and clustering families:

演算法Algorithm 精確度Accuracy 定型時間Training time 線性Linearity 參數Parameters 注意事項Notes
分類系列Classification family
雙類別羅吉斯回歸Two-Class logistic regression Good 快速Fast Yes 44
雙類別決策樹系Two-class decision forest 非常好Excellent Moderate No 55 顯示較慢的評分時間。Shows slower scoring times. 建議不要使用「一對多」多元分類,因為在累積的樹狀結構預測中,藉由踏板鎖定所造成的評分時間較慢Suggest not working with One-vs-All Multiclass, because of slower scoring times caused by tread locking in accumulating tree predictions
雙類別促進式決策樹Two-class boosted decision tree 非常好Excellent Moderate No 66 高記憶體使用量Large memory footprint
二級類神經網路Two-class neural network Good Moderate No 88
雙類別平均認知Two-class averaged perceptron Good Moderate Yes 44
雙類別支援向量機器Two-class support vector machine Good 快速Fast Yes 55 適用於大型特徵集Good for large feature sets
多元羅吉斯回歸Multiclass logistic regression Good 快速Fast Yes 44
多元決策樹系Multiclass decision forest 非常好Excellent Moderate No 55 顯示較慢的評分時間Shows slower scoring times
多元促進式決策樹Multiclass boosted decision tree 非常好Excellent Moderate No 66 通常會以較少的風險來改善精確度Tends to improve accuracy with some small risk of less coverage
多元類神經網路Multiclass neural network Good Moderate No 88
一對多多元One-vs-all multiclass - - - - 請參閱選取的兩個類別方法的屬性See properties of the two-class method selected
回歸系列Regression family
線性回歸Linear regression Good 快速Fast Yes 44
決策樹系回歸Decision forest regression 非常好Excellent Moderate No 55
促進式決策樹回歸Boosted decision tree regression 非常好Excellent Moderate No 66 高記憶體使用量Large memory footprint
類神經網路回歸Neural network regression Good Moderate No 88
群集系列Clustering family
K 表示群集K-means clustering 非常好Excellent Moderate Yes 88 叢集演算法A clustering algorithm

資料科學案例的需求Requirements for a data science scenario

一旦知道您想要如何處理資料,您需要判斷解決方案的其他需求。Once you know what you want to do with your data, you need to determine additional requirements for your solution.

進行下列需求的選擇和可能取捨:Make choices and possibly trade-offs for the following requirements:

  • 精確度Accuracy
  • 定型時間Training time
  • 線性Linearity
  • 參數數目Number of parameters
  • 特徵數目Number of features

精確度Accuracy

機器學習的精確度可將模型的效能,視為整體案例的真實結果比例。Accuracy in machine learning measures the effectiveness of a model as the proportion of true results to total cases. 在 Machine Learning 表設計工具中, [評估模型] 模組 會計算一組業界標準的評估度量。In Machine Learning designer, the Evaluate Model module computes a set of industry-standard evaluation metrics. 您可以使用此模組來測量定型模型的精確度。You can use this module to measure the accuracy of a trained model.

可能不一定需要取得最準確的答案。Getting the most accurate answer possible isn’t always necessary. 視您的用途而定,有時候近似值便已足夠。Sometimes an approximation is adequate, depending on what you want to use it for. 如果是這種情況,您可以藉由堅持使用更近似的方法來大幅縮短處理時間。If that is the case, you may be able to cut your processing time dramatically by sticking with more approximate methods. 大約的方法也傾向于避免過度學習。Approximate methods also naturally tend to avoid overfitting.

使用「評估模型」模組的方式有三種:There are three ways to use the Evaluate Model module:

  • 產生定型資料的分數以評估模型Generate scores over your training data in order to evaluate the model
  • 在模型上產生分數,但在保留的測試集上比較分數與分數Generate scores on the model, but compare those scores to scores on a reserved testing set
  • 使用相同的資料集,比較兩個不同但相關模型的分數Compare scores for two different but related models, using the same set of data

如需可用來評估機器學習模型精確度的計量和方法完整清單,請參閱 評估模型模組For a complete list of metrics and approaches you can use to evaluate the accuracy of machine learning models, see Evaluate Model module.

定型時間Training time

在監督式學習中,定型表示使用歷程記錄資料來建立機器學習模型,以將錯誤降至最低。In supervised learning, training means using historical data to build a machine learning model that minimizes errors. 定型出一個模型可能需要幾分鐘或幾小時,這在各個演算法間有很大的差異。The number of minutes or hours necessary to train a model varies a great deal between algorithms. 定型時間通常會緊密地系結至精確度;其中一個通常會隨附于另一個。Training time is often closely tied to accuracy; one typically accompanies the other.

此外,有些演算法對資料點的數目較為敏感。In addition, some algorithms are more sensitive to the number of data points than others. 您可以選擇特定的演算法,因為您有時間限制,尤其是當資料集很大時。You might choose a specific algorithm because you have a time limitation, especially when the data set is large.

在 Machine Learning 設計工具中,建立和使用機器學習模型通常是三個步驟的程式:In Machine Learning designer, creating and using a machine learning model is typically a three-step process:

  1. 藉由選擇特定類型的演算法,然後定義其參數或超參數,來設定模型。Configure a model, by choosing a particular type of algorithm, and then defining its parameters or hyperparameters.

  2. 提供標記且資料集與演算法相容的資料集。Provide a dataset that is labeled and has data compatible with the algorithm. 將資料和模型連接到 定型模型模組Connect both the data and the model to Train Model module.

  3. 定型完成之後,請使用定型的模型搭配其中一個 評分模組 來對新資料進行預測。After training is completed, use the trained model with one of the scoring modules to make predictions on new data.

線性Linearity

統計資料和機器學習中的線性表示資料集內的變數和常數之間有線性關聯性。Linearity in statistics and machine learning means that there is a linear relationship between a variable and a constant in your dataset. 例如,線性分類演算法假設類別可以用直線分隔 (或其較高維度的類比) 。For example, linear classification algorithms assume that classes can be separated by a straight line (or its higher-dimensional analog).

許多機器學習演算法都會使用線性。Lots of machine learning algorithms make use of linearity. 在 Azure Machine Learning 設計工具中,它們包含:In Azure Machine Learning designer, they include:

線性迴歸演算法會假設資料趨勢依循著一條直線。Linear regression algorithms assume that data trends follow a straight line. 對某些問題來說,這種假設不是不良的,但對其他問題來說,它會降低精確度This assumption isn't bad for some problems, but for others it reduces accuracy. 儘管有其缺點,線性演算法也是最常見的一種策略。Despite their drawbacks, linear algorithms are popular as a first strategy. 這種演算法定型起來通常又快又簡單。They tend to be algorithmically simple and fast to train.

非線性類別界限

*非線性類別界限 _:線性分類演算法上的 _Relying 會導致低精確度。 ***Nonlinear class boundary* _: _Relying on a linear classification algorithm would result in low accuracy.*

具有非線性趨勢的資料

*具有非線性趨勢的資料 _: _Using 線性回歸方法會產生比所需更大的錯誤。 *Data with a nonlinear trend _: _Using a linear regression method would generate much larger errors than necessary.

參數數目Number of parameters

參數是資料科學家在設定演算法時的必經之路。Parameters are the knobs a data scientist gets to turn when setting up an algorithm. 它們是影響演算法行為的數位,例如容錯或反覆運算次數,或演算法行為的變異變數之間的選項。They are numbers that affect the algorithm’s behavior, such as error tolerance or number of iterations, or options between variants of how the algorithm behaves. 演算法的定型時間和精確度有時可能會受到敏感性,以取得正確的設定。The training time and accuracy of the algorithm can sometimes be sensitive to getting just the right settings. 通常,具有大量參數的演算法需要最多的試用和錯誤才能找到良好的組合。Typically, algorithms with large numbers of parameters require the most trial and error to find a good combination.

或者,Machine Learning 設計工具中有 微調模型超參數模組 :此課程模組的目標是要判斷機器學習模型的最佳超參數。Alternatively, there is the Tune Model Hyperparameters module in Machine Learning designer: The goal of this module is to determine the optimum hyperparameters for a machine learning model. 此模組會使用不同的設定組合來建立及測試多個模型。The module builds and tests multiple models by using different combinations of settings. 它會比較所有模型的計量,以取得設定的組合。It compares metrics over all models to get the combinations of settings.

雖然這是確定您已跨越參數空間的絕佳方法,但是訓練模型所需的時間會隨著參數數目以指數方式增加。While this is a great way to make sure you’ve spanned the parameter space, the time required to train a model increases exponentially with the number of parameters. 一般而言,具有許多參數的優點是可讓演算法有更大的彈性。The upside is that having many parameters typically indicates that an algorithm has greater flexibility. 如果您可以找到正確的參數設定組合,通常可以達到非常好的精確度。It can often achieve very good accuracy, provided you can find the right combination of parameter settings.

特徵數目Number of features

在機器學習中,功能是您嘗試分析之現象的可量化變數。In machine learning, a feature is a quantifiable variable of the phenomenon you are trying to analyze. 就特定的資料類型而言,可能會有比資料點數目更龐大的特徵數目。For certain types of data, the number of features can be very large compared to the number of data points. 基因學或文字資料通常屬於這種情況。This is often the case with genetics or textual data.

大量的功能可以 bog 一些學習演算法,讓訓練時間 unfeasibly 長。A large number of features can bog down some learning algorithms, making training time unfeasibly long. 支援向量機器 特別適用于具有大量功能的案例。Support vector machines are particularly well suited to scenarios with a high number of features. 基於這個理由,許多應用程式都使用了它們,從資訊抓取到文字和影像分類。For this reason, they have been used in many applications from information retrieval to text and image classification. 支援向量機器可用於分類和回歸工作。Support vector machines can be used for both classification and regression tasks.

特徵選取是指在指定的輸出中,將統計測試套用至輸入的進程。Feature selection refers to the process of applying statistical tests to inputs, given a specified output. 目標是要判斷哪些資料行是更具預測性的輸出。The goal is to determine which columns are more predictive of the output. Machine Learning 設計工具中的 以 [篩選為基礎的特徵選取] 模組 提供多個特徵選取演算法,可供選擇。The Filter Based Feature Selection module in Machine Learning designer provides multiple feature selection algorithms to choose from. 此模組包含相互關聯方法,例如皮耳森相互關聯和卡方值。The module includes correlation methods such as Pearson correlation and chi-squared values.

您也可以使用 排列功能重要性模組 來計算資料集的一組特徵重要性分數。You can also use the Permutation Feature Importance module to compute a set of feature importance scores for your dataset. 然後您可以利用這些分數來協助您判斷要在模型中使用的最佳功能。You can then leverage these scores to help you determine the best features to use in a model.

後續步驟Next steps