KMeansPlusPlus Class
Machine Learning KMeans clustering algorithm
- Inheritance
-
nimbusml.internal.core.cluster._kmeansplusplus.KMeansPlusPlusKMeansPlusPlusnimbusml.base_predictor.BasePredictorKMeansPlusPlussklearn.base.ClusterMixinKMeansPlusPlus
Constructor
KMeansPlusPlus(normalize='Auto', caching='Auto', n_clusters=5, number_of_threads=None, initialization_algorithm='KMeansYinyang', opt_tol=1e-07, maximum_number_of_iterations=1000, accel_mem_budget_mb=4096, feature=None, weight=None, **params)
Parameters
- feature
see Columns.
- weight
see Columns.
- normalize
Specifies the type of automatic normalization used:
"Auto"
: if normalization is needed, it is performed automatically. This is the default choice."No"
: no normalization is performed."Yes"
: normalization is performed."Warn"
: if normalization is needed, a warning message is displayed, but normalization is not performed.
Normalization rescales disparate data ranges to a standard scale.
Feature
scaling insures the distances between data points are proportional
and
enables various optimization methods such as gradient descent to
converge
much faster. If normalization is performed, a MaxMin
normalizer
is
used. It normalizes values in an interval [a, b] where -1 <= a <= 0
and 0 <= b <= 1
and b - a = 1
. This normalizer preserves
sparsity by mapping zero to zero.
- caching
Whether trainer should cache input training data.
- n_clusters
The number of clusters.
- number_of_threads
Degree of lock-free parallelism. Defaults to automatic. Determinism not guaranteed.
- initialization_algorithm
Cluster initialization algorithm.
- opt_tol
Tolerance parameter for trainer convergence. Low = slower, more accurate.
- maximum_number_of_iterations
Maximum number of iterations.
- accel_mem_budget_mb
Memory budget (in MBs) to use for KMeans acceleration.
- params
Additional arguments sent to compute engine.
Examples
###############################################################################
# KMeansPlusPlus
from nimbusml import Pipeline, FileDataStream
from nimbusml.cluster import KMeansPlusPlus
from nimbusml.datasets import get_dataset
from nimbusml.feature_extraction.categorical import OneHotVectorizer
# data input (as a FileDataStream)
path = get_dataset('infert').as_filepath()
data = FileDataStream.read_csv(path)
print(data.head())
# age case education induced parity ... row_num spontaneous ...
# 0 26 1 0-5yrs 1 6 ... 1 2 ...
# 1 42 1 0-5yrs 1 1 ... 2 0 ...
# 2 39 1 0-5yrs 2 6 ... 3 0 ...
# 3 34 1 0-5yrs 2 4 ... 4 0 ...
# 4 35 1 6-11yrs 1 3 ... 5 1 ...
# define the training pipeline
pipeline = Pipeline([
OneHotVectorizer(columns={'edu': 'education'}),
KMeansPlusPlus(n_clusters=5, feature=['induced', 'edu', 'parity'])
])
# train, predict, and evaluate
metrics, predictions = pipeline \
.fit(data) \
.test(data, 'induced', output_scores=True)
# print predictions
print(predictions.head())
# PredictedLabel Score.0 Score.1 Score.2 Score.3 Score.4
# 0 4 2.732253 2.667988 2.353899 2.339244 0.092014
# 1 4 2.269290 2.120064 2.102576 2.222578 0.300347
# 2 4 3.482253 3.253153 2.425328 2.269245 0.258680
# 3 4 3.130401 2.867317 2.158132 2.055911 0.175347
# 4 2 0.287809 2.172567 0.036439 2.102578 2.050347
# print evaluation metrics
print(metrics)
# NMI AvgMinScore
# 0 0.521177 0.074859
Remarks
K-means is a popular clustering algorithm. With K-means++, the data is clustered into a specified number of clusters in order to minimize the within-cluster sum of squares. K-means++ improves upon K-means by using a better method for choosing the initial cluster centers.
Reference
https://www.microsoft.com/en-us/research/wp- content/uploads/2016/02/ding15.pdf
Methods
get_params |
Get the parameters for this operator. |
get_params
Get the parameters for this operator.
get_params(deep=False)
Parameters
- deep