Machine learning tasks in ML.NET

When building a machine learning model, you first need to define what you are hoping to achieve with your data. Afterwards, you can pick the right machine learning task for your situation. The following list describes the different machine learning tasks that you can choose from and some common use cases.

Note

ML.NET is currently in Preview. Not all machine learning tasks are currently supported. To submit a request for a certain task, open an issue in the dotnet/machinelearning repository.

Binary classification

A supervised machine learning task that is used to predict which of two classes (categories) an instance of data belongs to. The input of a classification algorithm is a set of labeled examples, where each label is an integer of either 0 or 1. The output of a binary classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. Examples of binary classification scenarios include:

  • Understanding sentiment of Twitter comments as either "positive" or "negative".
  • Diagnosing whether a patient has a certain disease or not.
  • Making a decision to mark an email as "spam" or not.
  • Determining if a photo contains a dog or fruit.

For more information, see the Binary classification article on Wikipedia.

Recommended learners for binary classification:

  • AveragedPerceptronTrainer
  • StochasticGradientDescentClassificationTrainer
  • LightGbmBinaryTrainer
  • FastTreeBinaryClassificationTrainer
  • SymSgdClassificationTrainer

Binary Classification Learners

The following learners are available for binary classification tasks:

Multiclass classification

A supervised machine learning task that is used to predict the class (category) of an instance of data. The input of a classification algorithm is a set of labeled examples. Each label normally starts as text. It is then run through the TermTransform, which converts it to the Key (numeric) type. The output of a classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. Examples of multi-class classification scenarios include:

  • Determining the breed of a dog as a "Siberian Husky", "Golden Retriever", "Poodle", etc.
  • Understanding movie reviews as "positive", "neutral", or "negative".
  • Categorizing hotel reviews as "location", "price", "cleanliness", etc.

For more information, see the Multiclass classification article on Wikipedia.

Recommended learners for Multi-class:

  • OVA-AveragedPerceptronTrainer
  • SdcaMultiClassTrainer
  • LightGbmMulticlassTrainer
  • OVA-FastTreeBinaryClassificationTrainer

Note

OVA and PKPD upgrades any binary classification learner to act on multiclass datasets. More information on [Wikipedia] (https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest).

Multiclass Classification Learners

The following learners are available for multiclass classification tasks:

Regression

A supervised machine learning task that is used to predict the value of the label from a set of related features. The label can be of any real value and is not from a finite set of values as in classification tasks. Regression algorithms model the dependency of the label on its related features to determine how the label will change as the values of the features are varied. The input of a regression algorithm is a set of examples with labels of known values. The output of a regression algorithm is a function, which you can use to predict the label value for any new set of input features. Examples of regression scenarios include:

  • Predicting house prices based on house attributes such as number of bedrooms, location, or size.
  • Predicting future stock prices based on historical data and current market trends.
  • Predicting sales of a product based on advertising budgets.

Recommended learners for regression:

  • FastTreeTweedieTrainer
  • LightGbmRegressorTrainer
  • SdcaRegressionTrainer
  • FastTreeRegressionTrainer

Regression Learners

The following learners are available for regression tasks:

Clustering

An unsupervised machine learning task that is used to group instances of data into clusters that contain similar characteristics. Clustering can also be used to identify relationships in a dataset that you might not logically derive by browsing or simple observation. The inputs and outputs of a clustering algorithm depends on the methodology chosen. You can take a distribution, centroid, connectivity, or density-based approach. ML.NET currently supports a centroid-based approach using K-Means clustering. Examples of clustering scenarios include:

  • Understanding segments of hotel guests based on habits and characteristics of hotel choices.
  • Identifying customer segments and demographics to help build targeted advertising campaigns.
  • Categorizing inventory based on manufacturing metrics.

Clustering Learners

The following learners are available for clustering tasks:

Anomaly detection

This task creates an anomaly detection model by using Principal Component Analysis (PCA). PCA-Based Anomaly Detection helps you build a model in scenarios where it is easy to obtain training data from one class, such as valid transactions, but difficult to obtain sufficient samples of the targeted anomalies.

An established technique in machine learning, PCA is frequently used in exploratory data analysis because it reveals the inner structure of the data and explains the variance in the data. PCA works by analyzing data that contains multiple variables. It looks for correlations among the variables and determines the combination of values that best captures differences in outcomes. These combined feature values are used to create a more compact feature space called the principal components.

Anomaly detection encompasses many important tasks in machine learning:

  • Identifying transactions that are potentially fraudulent.
  • Learning patterns that indicate that a network intrusion has occurred.
  • Finding abnormal clusters of patients.
  • Checking values entered into a system.

Because anomalies are rare events by definition, it can be difficult to collect a representative sample of data to use for modeling. The algorithms included in this category have been especially designed to address the core challenges of building and training models by using imbalanced data sets.

Anomaly detection learners

The following learners are available for anomaly detection tasks:

Ranking

A ranking task constructs a ranker from a set of labeled examples. This example set consists of instance groups that can be scored with a given criteria. The ranking labels are { 0, 1, 2, 3, 4 } for each instance. The ranker is trained to rank new instance groups with unknown scores for each instance. ML.NET ranking learners are machine learned ranking based.

Ranking learners

The following learners are available for ranking tasks:

Recommendation

A recommendation task enables producing a list of recommended products or services. ML.NET uses Matrix factorization (MF), a collaborative filtering algorithm for recommendations when you have historical product rating data in your catalog. For example, you have historical movie rating data for your users and want to recommend other movies they are likely to watch next.

Recommendation learners

The following learners are available for recommendation tasks: