ML.NET 中的機器學習工作Machine learning tasks in ML.NET

建置機器學習模型時,您必須先定義希望利用資料來達成的目標。When building a machine learning model, you first need to define what you are hoping to achieve with your data. 這可讓您挑選適合您情況的機器學習工作。This allows you to choose the right machine learning task for your situation. 以下清單描述可供您選擇的各種不同機器學習工作,以及一些常見的使用案例。The following list describes the different machine learning tasks that you can choose from and some common use cases.

一旦您決定哪一個工作適用於您的案例,您就必須選擇最佳演算法來訓練模型。Once you have decided which task works for your scenario, then you need to choose the best algorithm to train your model. 每個工作的區段中列出了可用的演算法。The available algorithms are listed in the section for each task.

二元分類Binary classification

這是一個監督式機器學習工作,可用來預測資料執行個體屬於兩個類別 (分類) 中的哪一個。A supervised machine learning task that is used to predict which of two classes (categories) an instance of data belongs to. 分類演算法的輸入是一組已加上標籤的範例,其中每個標籤都是 0 或 1 的整數。The input of a classification algorithm is a set of labeled examples, where each label is an integer of either 0 or 1. 二元分類演算法的輸出是一個分類器,可供您用來預測未加標籤之新執行個體的類別。The output of a binary classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. 二元分類案例的範例包括:Examples of binary classification scenarios include:

  • 理解 Twitter 評論的情感是「正面」還是「負面」。Understanding sentiment of Twitter comments as either "positive" or "negative".
  • 診斷病患是否有某種疾病。Diagnosing whether a patient has a certain disease or not.
  • 決定是否要將電子郵件標示為「垃圾郵件」。Making a decision to mark an email as "spam" or not.
  • 決定相片是否包含狗或水果。Determining if a photo contains a dog or fruit.

如需詳細資訊,請參閱維基百科上的二元分類 (英文) 一文。For more information, see the Binary classification article on Wikipedia.

二元分類訓練工具Binary classification trainers

您可以使用下列演算法訓練二元分類模型:You can train a binary classification model using the following algorithms:

二元分類的輸入和輸出Binary classification inputs and outputs

為了取得二元分類的最佳結果,定型資料應進行平衡 (亦即,具有相同數量的正向和負向定型資料)。For best results with binary classification, the training data should be balanced (that is, equal numbers of positive and negative training data). 遺漏值必須在定型前進行處理。Missing values should be handled before training.

輸入標籤資料行資料必須是 BooleanThe input label column data must be Boolean. 輸入特徵資料行資料必須是 Single 的固定大小向量。The input features column data must be a fixed-size vector of Single.

這些訓練工具會輸出下列資料行:These trainers outputs the following columns:

輸出資料行名稱Output Column Name 資料行型別Column Type 說明Description
Score Single 由模型計算的原始分數The raw score that was calculated by the model
PredictedLabel Boolean 預測標籤 (根據分數的正負號)。The predicted label, based on the sign of the score. 負值分數會對應到 false,正值分數則會對應到 trueA negative score maps to false and a positive score maps to true.

多元分類Multiclass classification

這是一個監督式機器學習工作,可用來預測資料執行個體的類別 (分類)。A supervised machine learning task that is used to predict the class (category) of an instance of data. 分類演算法的輸入是一組已加上標籤的範例。The input of a classification algorithm is a set of labeled examples. 每個標籤通常會啟動成文字。Each label normally starts as text. 接著,它會透過 TermTransform 執行,這會將它轉會為索引鍵 (數值) 型別。It is then run through the TermTransform, which converts it to the Key (numeric) type. 分類演算法的輸出是一個分類器,可供您用來預測未加標籤之新執行個體的類別。The output of a classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. 多元分類案例的範例包括:Examples of multi-class classification scenarios include:

  • 判斷狗的品種,例如「西伯利亞哈士奇」、「黃金獵犬」、「貴賓狗」等。Determining the breed of a dog as a "Siberian Husky", "Golden Retriever", "Poodle", etc.
  • 理解影片評論是「正面」、「中立」還是「負面」。Understanding movie reviews as "positive", "neutral", or "negative".
  • 將飯店評論分類成「地點」、「價格」、「整潔度」等。Categorizing hotel reviews as "location", "price", "cleanliness", etc.

如需詳細資訊,請參閱維基百科上的多元分類 (英文) 一文。For more information, see the Multiclass classification article on Wikipedia.

注意

One-Vs-All 將任何二元分類學習工具升級,以在多元分類資料集上運作。One vs all upgrades any binary classification learner to act on multiclass datasets. 如需詳細資訊,請參閱 [Wikipedia] (https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest) 。More information on [Wikipedia] (https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest).

多類別分類學習工具Multiclass classification trainers

您可以使用下列訓練演算法訓練多元分類模型:You can train a multiclass classification model using the following training algorithms:

多類別分類的輸入和輸出Multiclass classification inputs and outputs

輸入標籤資料行資料必須是 key 類型。The input label column data must be key type. 特徵資料行必須是 Single 的固定大小向量。The feature column must be a fixed size vector of Single.

訓練工具輸出下列內容:This trainer outputs the following:

輸出名稱Output Name 類型Type 說明Description
Score Single 的向量Vector of Single 所有類別的分數。The scores of all classes. 較高值表示落入相關聯類別的機率較高。Higher value means higher probability to fall into the associated class. 若第 i 個項目具有最大值,則預測標籤索引將會是 i。If the i-th element has the largest value, the predicted label index would be i. 請注意,i 是以零為基礎的索引。Note that i is zero-based index.
PredictedLabel key 類型key type 預測標籤的索引。The predicted label's index. 若其值是 i,則實際標籤可能會是索引鍵/值輸入標籤類型中的第 i 個類別。If its value is i, the actual label would be the i-th category in the key-valued input label type.

回復Regression

這是一個監督式機器學習工作,可用來從一組相關的特徵預測標籤的值。A supervised machine learning task that is used to predict the value of the label from a set of related features. 標籤可以有任何實際值,而不像在分類工作中那樣來自一組有限的值。The label can be of any real value and is not from a finite set of values as in classification tasks. 迴歸演算法會根據標籤的相關特徵建立標籤的相依性模型,以決定標籤會隨著特徵值的變化如何變更。Regression algorithms model the dependency of the label on its related features to determine how the label will change as the values of the features are varied. 迴歸演算法的輸入是一組標籤為已知值的範例。The input of a regression algorithm is a set of examples with labels of known values. 迴歸演算法的輸出是一個函式,可供您用來預測任何一組新輸入特徵的標籤值。The output of a regression algorithm is a function, which you can use to predict the label value for any new set of input features. 迴歸案例的範例包括:Examples of regression scenarios include:

  • 根據房屋屬性 (例如房間數、地點、大小) 預測房價。Predicting house prices based on house attributes such as number of bedrooms, location, or size.
  • 根據歷程記錄資料和目前的市場趨勢預測未來的股價。Predicting future stock prices based on historical data and current market trends.
  • 根據廣告預算預測產品銷售額。Predicting sales of a product based on advertising budgets.

迴歸訓練工具Regression trainers

您可以使用下列演算法訓練迴歸模型:You can train a regression model using the following algorithms:

迴歸的輸入和輸出Regression inputs and outputs

輸入標籤資料行資料必須是 SingleThe input label column data must be Single.

這項工作的訓練工具會輸出下列內容:The trainers for this task output the following:

輸出名稱Output Name 類型Type 說明Description
Score Single 由模型預測的原始分數The raw score that was predicted by the model

群集Clustering

這是一個非監督式機器學習工作,可用來將資料執行個體組成包含類似特性的群集。An unsupervised machine learning task that is used to group instances of data into clusters that contain similar characteristics. 群集也可用來在資料集內,識別您無法藉由瀏覽或簡單觀察以邏輯方式導出的關係。Clustering can also be used to identify relationships in a dataset that you might not logically derive by browsing or simple observation. 群集演算法的輸入和輸出取決於所選擇的方法。The inputs and outputs of a clustering algorithm depends on the methodology chosen. 您可以採用以分佈、距心、連線或密度為基礎的方法。You can take a distribution, centroid, connectivity, or density-based approach. ML.NET 目前支援使用 K 平均 (K-Means) 群集的距心型方法。ML.NET currently supports a centroid-based approach using K-Means clustering. 群集案例的範例包括:Examples of clustering scenarios include:

  • 根據選擇飯店時的習慣和特性,理解飯店賓客的區隔。Understanding segments of hotel guests based on habits and characteristics of hotel choices.
  • 識別客戶區隔和人口統計,以協助建立目標性廣告活動。Identifying customer segments and demographics to help build targeted advertising campaigns.
  • 根據製造計量來分類庫存。Categorizing inventory based on manufacturing metrics.

叢集訓練工具Clustering trainer

您可以使用下列演算法訓練叢集模型:You can train a clustering model using the following algorithm:

叢集的輸入和輸出Clustering inputs and outputs

輸入特徵資料必須是 SingleThe input features data must be Single. 不需要任何標籤。No labels are needed.

訓練工具輸出下列內容:This trainer outputs the following:

輸出名稱Output Name 類型Type 說明Description
Score Single 的向量vector of Single 指定資料點到所有叢集幾何中心的距離The distances of the given data point to all clusters' centriods
PredictedLabel key 類型key type 由模型所預測最接近的叢集索引。The closest cluster's index predicted by the model.

異常偵測Anomaly detection

這項工作會使用主成分分析 (PCA) 建立異常偵測模型。This task creates an anomaly detection model by using Principal Component Analysis (PCA). 以 PCA 為基礎的異常偵測可協助建置模型,在這種情況中可輕鬆地從一個類別 (例如有效的交易) 取得定型資料,但難以取得足夠的目標異常狀況樣本。PCA-Based Anomaly Detection helps you build a model in scenarios where it is easy to obtain training data from one class, such as valid transactions, but difficult to obtain sufficient samples of the targeted anomalies.

PCA 是以機器學習所建立的技術,經常用於探索資料分析,因為它能顯示資料的內部結構,並說明資料中的差異。An established technique in machine learning, PCA is frequently used in exploratory data analysis because it reveals the inner structure of the data and explains the variance in the data. PCA 藉由分析包含多個變數的資料來運作。PCA works by analyzing data that contains multiple variables. 它會尋找變數之間的關聯性,並決定最適合擷取結果中差異的值組合。It looks for correlations among the variables and determines the combination of values that best captures differences in outcomes. 這些組合的特徵值會用來建立更精簡的特徵空間,稱為主成分。These combined feature values are used to create a more compact feature space called the principal components.

異常偵測包含了機器學習中許多重要的工作:Anomaly detection encompasses many important tasks in machine learning:

  • 識別潛在的詐騙交易。Identifying transactions that are potentially fraudulent.
  • 學習指出發生網路入侵的模式。Learning patterns that indicate that a network intrusion has occurred.
  • 尋找異常的患者叢集。Finding abnormal clusters of patients.
  • 檢查輸入到系統的值。Checking values entered into a system.

因為定義上異常是罕見事件,因此難以收集具代表性的資料樣本來進行模型化。Because anomalies are rare events by definition, it can be difficult to collect a representative sample of data to use for modeling. 此類別中包含的演算法經過特別設計,用來解決使用不平衡的資料集建置和定型模型所發生的核心挑戰。The algorithms included in this category have been especially designed to address the core challenges of building and training models by using imbalanced data sets.

異常偵測訓練工具Anomaly detection trainer

您可以使用下列演算法訓練異常偵測模型:You can train an anomaly detection model using the following algorithm:

異常偵測的輸入和輸出Anomaly detection inputs and outputs

輸入特徵必須是 Single 的固定大小向量。The input features must be a fixed-sized vector of Single.

訓練工具輸出下列內容:This trainer outputs the following:

輸出名稱Output Name 類型Type 說明Description
Score Single 由異常偵測模型所計算之非負數且沒有限制的分數The non-negative, unbounded score that was calculated by the anomaly detection model

排名Ranking

排名工作會從一組已加上標籤的範例建構排名工具。A ranking task constructs a ranker from a set of labeled examples. 此範例集包含能夠以指定條件評分的執行個體群組。This example set consists of instance groups that can be scored with a given criteria. 每個執行個體的排名標籤是 { 0, 1, 2, 3, 4 }。The ranking labels are { 0, 1, 2, 3, 4 } for each instance. 排名工具已定型為使用每個執行個體的未知分數排名新的執行個體群組。The ranker is trained to rank new instance groups with unknown scores for each instance. ML.NET 排名學習工具是以機器學習排名 (英文) 為基礎。ML.NET ranking learners are machine learned ranking based.

排名訓練演算法Ranking training algorithms

您可以使用下列演算法訓練排名模型:You can train a ranking model with the following algorithms:

排名的輸入和輸出Ranking input and outputs

輸入標籤資料類型必須是 key 類型或 SingleThe input label data type must be key type or Single. 標籤的值會決定相關性,其中較高值會指出較高的相關性。The value of the label determines relevance, where higher values indicate higher relevance. 若標籤是 key 類型,則索引鍵的索引是相關性值,其中最小的索引表示最不相關。If the label is a key type, then the key index is the relevance value, where the smallest index is the least relevant. 若標籤是 Single,則較大的值表示相關性較高。If the label is a Single, larger values indicate higher relevance.

特徵資料必須是 Single 的固定大小向量,而輸入資料列群組資料行必須是 key 類型。The feature data must be a fixed size vector of Single and input row group column must be key type.

訓練工具輸出下列內容:This trainer outputs the following:

輸出名稱Output Name 類型Type 說明Description
Score Single 由模型計算的無限制分數,用來判斷預測The unbounded score that was calculated by the model to determine the prediction

建議Recommendation

建議工作可產生建議的產品或服務清單。A recommendation task enables producing a list of recommended products or services. ML.NET 使用矩陣分解 (MF) (英文),這是一種協同過濾演算法,適用於在您目錄中有過往的產品評等資料時提供建議。ML.NET uses Matrix factorization (MF), a collaborative filtering algorithm for recommendations when you have historical product rating data in your catalog. 例如,您擁有使用者過往的電影評等資料,而想要建議使用者接下來可能想看的其他電影。For example, you have historical movie rating data for your users and want to recommend other movies they are likely to watch next.

建議訓練演算法Recommendation training algorithms

您可以使用下列演算法訓練建議模型:You can train a recommendation model with the following algorithm: