KMeansPlusPlus Class

Machine Learning KMeans clustering algorithm

Inheritance
nimbusml.internal.core.cluster._kmeansplusplus.KMeansPlusPlus
KMeansPlusPlus
nimbusml.base_predictor.BasePredictor
KMeansPlusPlus
sklearn.base.ClusterMixin
KMeansPlusPlus

Constructor

KMeansPlusPlus(normalize='Auto', caching='Auto', n_clusters=5, number_of_threads=None, initialization_algorithm='KMeansYinyang', opt_tol=1e-07, maximum_number_of_iterations=1000, accel_mem_budget_mb=4096, feature=None, weight=None, **params)

Parameters

feature

see Columns.

weight

see Columns.

normalize

Specifies the type of automatic normalization used:

  • "Auto": if normalization is needed, it is performed automatically. This is the default choice.

  • "No": no normalization is performed.

  • "Yes": normalization is performed.

  • "Warn": if normalization is needed, a warning message is displayed, but normalization is not performed.

Normalization rescales disparate data ranges to a standard scale. Feature scaling insures the distances between data points are proportional and enables various optimization methods such as gradient descent to converge much faster. If normalization is performed, a MaxMin normalizer is used. It normalizes values in an interval [a, b] where -1 <= a <= 0 and 0 <= b <= 1 and b - a = 1. This normalizer preserves sparsity by mapping zero to zero.

caching

Whether trainer should cache input training data.

n_clusters

The number of clusters.

number_of_threads

Degree of lock-free parallelism. Defaults to automatic. Determinism not guaranteed.

initialization_algorithm

Cluster initialization algorithm.

opt_tol

Tolerance parameter for trainer convergence. Low = slower, more accurate.

maximum_number_of_iterations

Maximum number of iterations.

accel_mem_budget_mb

Memory budget (in MBs) to use for KMeans acceleration.

params

Additional arguments sent to compute engine.

Examples


   ###############################################################################
   # KMeansPlusPlus
   from nimbusml import Pipeline, FileDataStream
   from nimbusml.cluster import KMeansPlusPlus
   from nimbusml.datasets import get_dataset
   from nimbusml.feature_extraction.categorical import OneHotVectorizer

   # data input (as a FileDataStream)
   path = get_dataset('infert').as_filepath()

   data = FileDataStream.read_csv(path)
   print(data.head())
   #    age  case education  induced  parity ... row_num  spontaneous  ...
   # 0   26     1    0-5yrs        1       6 ...       1            2  ...
   # 1   42     1    0-5yrs        1       1 ...       2            0  ...
   # 2   39     1    0-5yrs        2       6 ...       3            0  ...
   # 3   34     1    0-5yrs        2       4 ...       4            0  ...
   # 4   35     1   6-11yrs        1       3 ...       5            1  ...

   # define the training pipeline
   pipeline = Pipeline([
       OneHotVectorizer(columns={'edu': 'education'}),
       KMeansPlusPlus(n_clusters=5, feature=['induced', 'edu', 'parity'])
   ])

   # train, predict, and evaluate
   metrics, predictions = pipeline \
       .fit(data) \
       .test(data, 'induced', output_scores=True)

   # print predictions
   print(predictions.head())
   #   PredictedLabel   Score.0   Score.1   Score.2   Score.3   Score.4
   # 0               4  2.732253  2.667988  2.353899  2.339244  0.092014
   # 1               4  2.269290  2.120064  2.102576  2.222578  0.300347
   # 2               4  3.482253  3.253153  2.425328  2.269245  0.258680
   # 3               4  3.130401  2.867317  2.158132  2.055911  0.175347
   # 4               2  0.287809  2.172567  0.036439  2.102578  2.050347

   # print evaluation metrics
   print(metrics)
   #        NMI  AvgMinScore
   # 0  0.521177     0.074859

Remarks

K-means is a popular clustering algorithm. With K-means++, the data is clustered into a specified number of clusters in order to minimize the within-cluster sum of squares. K-means++ improves upon K-means by using a better method for choosing the initial cluster centers.

Reference

https://www.microsoft.com/en-us/research/wp- content/uploads/2016/02/ding15.pdf

Methods

get_params

Get the parameters for this operator.

get_params

Get the parameters for this operator.

get_params(deep=False)

Parameters

deep
default value: False