Cluster classification in RevoScaleR

Clustering is the general name for any of a large number of classification techniques that involve assigning observations to membership in one of two or more clusters on the basis of some distance metric.

K-means Clustering

K-means clustering is a classification technique that groups observations of numeric data using one of several iterative relocation algorithms. Starting from some initial classification, which may be random, points are moved from cluster to another so as to minimize sums of squares. In RevoScaleR, the algorithm used is that of Lloyd.

To perform k-means clustering with RevoScaleR, use the rxKmeans function.

Clustering the Airline Data

As a first example of k-means clustering, we will cluster the arrival delay and scheduled departure time in the airline data 7% subsample. To start, we extract variables of interest into a new working data set to which we are writing additional information:

#  K-means Clustering

#   Clustering the Airline Data  
bigDataDir <- "C:/MRS/Data"
sampleAirData <- file.path(bigDataDir, "AirOnTime7Pct.xdf")
rxDataStep(inData = sampleAirData, outFile = "AirlineDataClusterVars.xdf",
  varsToKeep=c("DayOfWeek", "ArrDelay", "CRSDepTime", "DepDelay"))

We specify the variables to cluster as a formula, and specify the number of clusters we’d like. Initial centers for these clusters are then chosen at random.

kclusts1 <- rxKmeans(formula= ~ArrDelay + CRSDepTime, 
	data = "AirlineDataClusterVars.xdf",
	seed = 10,
	outFile = "airlineDataClusterVars.xdf", numClusters=5)
kclusts1

This produces the following output (because the initial centers are chosen at random, your output will probably look different):

Call:
rxKmeans(formula = ~ArrDelay + CRSDepTime, data = "AirlineDataClusterVars.xdf", 
    outFile = "AirlineDataClusterVars.xdf", numClusters = 5)

Data: "AirlineDataClusterVars.xdf"
Number of valid observations: 10186272
Number of missing observations: 213483 
Clustering algorithm:  
 
K-means clustering with 5 clusters of sizes 922985, 38192, 4772791, 261779, 4190525

Cluster means:
    ArrDelay CRSDepTime
1  45.258179   14.86596
2 275.363820   14.81432
3 -10.284426   13.08375
4 118.365205   15.52079
5   7.803893   13.53811

Within cluster sum of squares by cluster:
        1         2         3         4         5 
223220709 501736748 354763376 233533349 312403604 

Available components:
 [1] "centers"       "size"          "withinss"      "valid.obs"    
 [5] "missing.obs"   "numIterations" "tot.withinss"  "totss"        
 [9] "betweenss"     "cluster"       "params"        "formula"      
[13] "call"     

The value returned by rxKmeans is a list similar to the list returned by the standard R kmeans function. The printed output shows a subset of this information, including the number of valid and missing observations, the cluster sizes, the cluster centers, and the within-cluster sums of squares.

The cluster membership component is returned if the input is a data frame, but if the input is a .xdf file, cluster membership is returned only if outFile is specified, in which case it is returned not as part of the return object, but as a column in the specified file. In our example, we specified an outFile, and we see the cluster membership variable when we look at the file with rxGetInfo:

rxGetInfo("AirlineDataClusterVars.xdf", getVarInfo=TRUE)
 File name: AirlineDataClusterVars.xdf 
 Number of observations: 10399755 
 Number of variables: 5 
 Number of blocks: 19 
 Compression type: zlib 
 Variable information: 
 Var 1: DayOfWeek
        7 factor levels: Mon Tues Wed Thur Fri Sat Sun
 Var 2: ArrDelay, Type: integer, Low/High: (-1233, 2453)
 Var 3: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000)
 Var 4: DepDelay, Type: integer, Low/High: (-1199, 2467)
 Var 5: .rxCluster, Type: integer, Low/High: (1, 5)

Using the Cluster Membership Information

A common follow-up to clustering is to use the cluster membership information to see whether a given model varies appreciably from cluster to cluster. Since we can use the rowSelection argument to extract a single cluster on the fly, there is no need to sort the data first. As an example, we fit our original linear model of ArrDelay by DayOfWeek for two of the clusters:

#   Using the Cluster Membership Information
  
clust1Lm <- rxLinMod(ArrDelay ~ DayOfWeek, "AirlineDataClusterVars.xdf",
	rowSelection = .rxCluste r == 1 )
clust5Lm <- rxLinMod(ArrDelay ~ DayOfWeek, "AirlineDataClusterVars.xdf", 
	rowSelection = .rxCluster == 5)
summary(clust1Lm)
summary(clust5Lm)

Looking at the summary for clust1Lm shows the following:

Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = "AirlineDataClusterVars.xdf", 
    rowSelection = .rxCluster == 1)

Linear Regression Results for: ArrDelay ~ DayOfWeek
File name: AirlineDataClusterVars.xdf
Dependent variable(s): ArrDelay
Total independent variables: 8 (Including number dropped: 1)
Number of valid observations: 922985
Number of missing observations: 0 
 
Coefficients: (1 not defined because of singularities)
               Estimate Std. Error  t value Pr(>|t|)    
(Intercept)    45.21591    0.04237 1067.199 2.22e-16 ***
DayOfWeek=Mon   0.23053    0.05893    3.912 9.16e-05 ***
DayOfWeek=Tues -0.06496    0.05968   -1.089   0.2764    
DayOfWeek=Wed   0.10139    0.05869    1.727   0.0841 .  
DayOfWeek=Thur  0.06098    0.05708    1.068   0.2854    
DayOfWeek=Fri   0.23222    0.05660    4.103 4.08e-05 ***
DayOfWeek=Sat  -0.43444    0.06364   -6.827 8.68e-12 ***
DayOfWeek=Sun   Dropped    Dropped  Dropped  Dropped    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.89 on 922978 degrees of freedom
Multiple R-squared: 0.0001705 
Adjusted R-squared: 0.000164 
F-statistic: 26.24 on 6 and 922978 DF,  p-value: < 2.2e-16 
Condition number: 12.8655   

Similarly, the summary for clust5Lm shows the following:

Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = "AirlineDataClusterVars.xdf", 
    rowSelection = .rxCluster == 5)

Linear Regression Results for: ArrDelay ~ DayOfWeek
File name: AirlineDataClusterVars.xdf
Dependent variable(s): ArrDelay
Total independent variables: 8 (Including number dropped: 1)
Number of valid observations: 4190525
Number of missing observations: 0 
 
Coefficients: (1 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.808093   0.009593 813.960 2.22e-16 ***
DayOfWeek=Mon  -0.131001   0.013320  -9.835 2.22e-16 ***
DayOfWeek=Tues -0.228087   0.013374 -17.055 2.22e-16 ***
DayOfWeek=Wed  -0.035954   0.013292  -2.705  0.00683 ** 
DayOfWeek=Thur  0.231958   0.013170  17.613 2.22e-16 ***
DayOfWeek=Fri   0.313961   0.013171  23.838 2.22e-16 ***
DayOfWeek=Sat  -0.257716   0.014036 -18.361 2.22e-16 ***
DayOfWeek=Sun    Dropped    Dropped Dropped  Dropped    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.238 on 4190518 degrees of freedom
Multiple R-squared: 0.0007911 
Adjusted R-squared: 0.0007897 
F-statistic:   553 on 6 and 4190518 DF,  p-value: < 2.2e-16 
Condition number: 12.0006