kmeans_fl()

The function kmeans_fl() clusterizes a dataset using the k-means algorithm.

Note

Syntax

T | invoke kmeans_fl(k, features_cols, cluster_col)

Arguments

  • k: The required number of clusters.
  • features_cols: Dynamic array containing the names of the features columns that are used for clustering.
  • cluster_col: The name of the column that stores the output cluster ID for each record.

Usage

kmeans_fl() is a user-defined tabular function to be applied using the invoke operator. You can either embed its code in your query, or install it in your database. There are two usage options: ad hoc and persistent usage. See the below tabs for examples.

For ad hoc usage, embed the code using the let statement. No permission is required.

let kmeans_fl=(tbl:(*), k:int, features:dynamic, cluster_col:string)
{
    let kwargs = pack('k', k, 'features', features, 'cluster_col', cluster_col);
    let code = ```if 1:
        
        from sklearn.cluster import KMeans
        
        k = kargs["k"]
        features = kargs["features"]
        cluster_col = kargs["cluster_col"]
        
        km = KMeans(n_clusters=k)
        df1 = df[features]
        km.fit(df1)
        result = df
        result[cluster_col] = km.labels_
    ```;
    tbl
    | evaluate python(typeof(*), code, kwargs)
};
//
// Clusterize room occupancy from sensors measurements.
//
// Occupancy Detection is an open dataset from UCI Repository at https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
// It contains experimental data for binary classification of room occupancy from Temperature, Humidity, Light, and CO2.
//
OccupancyDetection
| extend cluster_id=double(null)
| invoke kmeans_fl(5, pack_array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"), "cluster_id")
| sample 10
Timestamp	                Temperature Humidity	Light  CO2         HumidityRatio Occupancy Test	 cluster_id
2015-02-02 14:38:00.0000000	23.64       27.1        473    908.8       0.00489763    TRUE	   TRUE  1
2015-02-03 01:47:00.0000000	20.575      22.125      0      446         0.00330878	 FALSE	   TRUE  0
2015-02-10 08:47:00.0000000	20.42666667 33.56       405    494.3333333 0.004986493	 TRUE	   FALSE 4
2015-02-10 09:15:00.0000000	20.85666667 35.09666667 433    665.3333333 0.005358055	 TRUE	   FALSE 4
2015-02-11 16:13:00.0000000	21.89       30.0225     429    771.75      0.004879358	 TRUE	   TRUE  4
2015-02-13 14:06:00.0000000	23.4175     26.5225     608    599.75      0.004728116	 TRUE	   TRUE  4
2015-02-13 23:09:00.0000000	20.13333333 32.2        0      502.6666667 0.004696278	 FALSE	   TRUE  0
2015-02-15 18:30:00.0000000	20.5        32.79       0      666.5       0.004893459	 FALSE	   TRUE  3
2015-02-17 13:43:00.0000000	21.7        33.9        454    1167        0.005450924	 TRUE	   TRUE  1
2015-02-17 18:17:00.0000000	22.025      34.2225     0      1538.25     0.005614538	 FALSE	   TRUE  2

Extract the centroids and size of each cluster, with the function already installed:

OccupancyDetection
| extend cluster_id=double(null)
| invoke kmeans_fl(5, pack_array("Temperature", "Humidity", "Light", "CO2", "HumidityRatio"), "cluster_id")
| summarize Temperature=avg(Temperature), Humidity=avg(Humidity), Light=avg(Light), CO2=avg(CO2), HumidityRatio=avg(HumidityRatio), num=count() by cluster_id
| order by num

cluster_id	Temperature        Humidity            Light            CO2                HumidityRatio        num
0	        20.3507186863278   27.1521395395781    10.1995789883291	486.804272186974   0.00400132147662714	11124
4	        20.9247315268427   28.7971160082823    20.7311894656536	748.965771574469   0.00440412568299058	3063
1	        22.0284137970445   27.8953334469889    481.872136037748	1020.70779349773   0.00456692559904535	2514
3	        22.0344177115763   25.1151053429273    462.358969056434	656.310608696507   0.00411782436443015	2176
2	        21.4091216082295   31.8363099552939    174.614816229606	1482.05062388414   0.00504573022875817	1683