# KMeansTrainer Class

## Definition

Important

Some information relates to prerelease product that may be substantially modified before itâ€™s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

The IEstimator<TTransformer> for training a KMeans clusterer

`public class KMeansTrainer : Microsoft.ML.Trainers.TrainerEstimatorBase<Microsoft.ML.Data.ClusteringPredictionTransformer<Microsoft.ML.Trainers.KMeansModelParameters>,Microsoft.ML.Trainers.KMeansModelParameters>`

```
type KMeansTrainer = class
inherit TrainerEstimatorBase<ClusteringPredictionTransformer<KMeansModelParameters>, KMeansModelParameters>
```

```
Public Class KMeansTrainer
Inherits TrainerEstimatorBase(Of ClusteringPredictionTransformer(Of KMeansModelParameters), KMeansModelParameters)
```

- Inheritance

## Remarks

To create this trainer, use KMeans or Kmeans(Options).

### Input and Output Columns

The input features column data must be Single. No label column needed. This trainer outputs the following columns:

Output Column Name | Column Type | Description |
---|---|---|

`Score` |
vector of Single | The distances of the given data point to all clusters' centriods. |

`PredictedLabel` |
key type | The closest cluster's index predicted by the model. |

### Trainer Characteristics

Machine learning task | Clustering |

Is normalization required? | Yes |

Is caching required? | Yes |

Required NuGet in addition to Microsoft.ML | None |

Exportable to ONNX | Yes |

### Training Algorithm Details

K-means is a popular clustering algorithm. With K-means, the data is clustered into a specified number of clusters in order to minimize the within-cluster sum of squared distances. This implementation follows the Yinyang K-means method. For choosing the initial cluster centeroids, one of three options can be used:

- Random initialization. This might lead to potentially bad approximations of the optimal clustering.
- The K-means++ method. This is an improved initialization algorithm introduced here by Ding et al., that guarantees to find a solution that is $O(log K)$ competitive to the optimal K-means solution.
- The K-means|| method. This method was introduced here by Bahmani et al., and uses a parallel method that drastically reduces the number of passes needed to obtain a good initialization.

K-means|| is the default initialization method. The other methods can be specified in the Options when creating the trainer using KMeansTrainer(Options).

### Scoring Function

The output Score column contains the square of the $L_2$-norm distance (i.e., Euclidean distance) of the given input vector $\textbf{x}\in \mathbb{R}^n$ to each cluster's centroid. Assume that the centriod of the $c$-th cluster is $\textbf{m}_c \in \mathbb{R}^n$. The $c$-th value at the Score column would be $d_c = || \textbf{x} - \textbf{m}_c ||_2^2$. The predicted label is the index with the smallest value in a $K$ dimensional vector $[d_{0}, \dots, d_{K-1}]$, where $K$ is the number of clusters.

For more information on K-means, and K-means++ see: K-means K-means++

Check the See Also section for links to usage examples.

## Fields

FeatureColumn |
The feature column that the trainer expects. (Inherited from TrainerEstimatorBase<TTransformer,TModel>) |

LabelColumn |
The label column that the trainer expects. Can be |

WeightColumn |
The weight column that the trainer expects. Can be |

## Properties

Info |

## Methods

Fit(IDataView) |
Trains and returns a ITransformer. (Inherited from TrainerEstimatorBase<TTransformer,TModel>) |

GetOutputSchema(SchemaShape) | (Inherited from TrainerEstimatorBase<TTransformer,TModel>) |

## Extension Methods

AppendCacheCheckpoint<TTrans>(IEstimator<TTrans>, IHostEnvironment) |
Append a 'caching checkpoint' to the estimator chain. This will ensure that the downstream estimators will be trained against cached data. It is helpful to have a caching checkpoint before trainers that take multiple data passes. |

WithOnFitDelegate<TTransformer>(IEstimator<TTransformer>, Action<TTransformer>) |
Given an estimator, return a wrapping object that will call a delegate once Fit(IDataView) is called. It is often important for an estimator to return information about what was fit, which is why the Fit(IDataView) method returns a specifically typed object, rather than just a general ITransformer. However, at the same time, IEstimator<TTransformer> are often formed into pipelines with many objects, so we may need to build a chain of estimators via EstimatorChain<TLastTransformer> where the estimator for which we want to get the transformer is buried somewhere in this chain. For that scenario, we can through this method attach a delegate that will be called once fit is called. |