How to summarize data using RevoScaleR

Summary statistics can help you understand the characteristics and shape of an unfamiliar data set. In RevoScaleR, you can use the rxGetVarInfo function to learn more about variables in the data set, and rxSummary for statistical measures. The rxSummary function also provides a count of observations, and the number of missing values (if any).

This article teaches by example, using built-in sample data sets so that you can practice each skill. It covers the following tasks:

  • List variable metadata including name, description, type, labels, Low/High values
  • Compute summary statistics: mean, standard deviation, minimum, maximum values
  • Compute summary statistics for factor variables
  • Create temporary factor variables used in computations
  • Compute summary statistics on a row subset
  • Execute in-flight data transformations
  • Compute and plot a Lorenz curve
  • Tips for wide data sets

List variable information

The rxGetVarInfo function returns information about the variables in a data frame or .xdf file, including variable names, descriptions, type, and high and low values. The following examples, based on the built-in sample Census data set, demonstrate function usage.

# Load data, store variable metadata as censusWorkerInfo, and return variable names.

readPath <- rxGetOption("sampleDataDir")
censusWorkers <- file.path(readPath, "CensusWorkers.xdf")
censusWorkerInfo <- rxGetVarInfo(censusWorkers)
names(censusWorkerInfo)

The names function is a base R function, which is called on the object to return the following variable names:

  [1] "age" "incwage" "perwt" "sex" "wkswork1" "state"

Once you have a variable name, you can drill down further to examine its metadata. This is the output for age:

names(censusWorkerInfo$age)

  [1] "description" "varType" "storage" "low" "high"

Numeric variables include Low/High values. The Low/High values do not necessarily indicate the minimum and maximum of a numeric or integer variable. Rather, they indicate the values RevoScaleR uses to establish the lowest and highest factor levels when treating the variable as a factor. For practical purposes, this is more helpful for data visualization than a true minimum or maximum on the variable. For example, suppose you want to create histograms or hexbin plots, treating the numerical data as categorical, with levels corresponding to the plotting bins. In this scenario, it is often convenient to cut off the highest and lowest data points, and the Low/High values provide this information.

This example demonstrates getting the High value from the age variable:

censusWorkerInfo$age$high

  [1] 65

Similarly, the Low value is obtained as follows:

censusWorkerInfo$age$low

  [1] 20

Suppose you are interested in workers between the ages of 30 and 50. You could create a copy of the data file in the working directory, set the High/Low fields as follows and then treat age as a factor in subsequent analysis:

outputDir <- rxGetOption("outDataPath")
tempCensusWorkers <- file.path(outputDir, "tempCensusWorkers.xdf")
file.copy(from = censusWorkers, to = tempCensusWorkers)
censusWorkerInfo$age$low <- 35
censusWorkerInfo$age$high <- 50
rxSetVarInfo(censusWorkerInfo, tempCensusWorkers)
rxSummary(~F(age), data = tempCensusWorkers)

Results (restricted to the 35 to 50 age range):

Call:
rxSummary(formula = ~F(age), data = tempCensusWorkers)

Summary Statistics Results for: ~F(age)
File name:
    C:\YourOutDir\tempCensusWorkers.xdf
Number of valid observations: 351121 

Category Counts for F_age
Number of categories: 16
Number of valid observations: 158309
Number of missing observations: 192812

 F_age Counts
 35     9743 
 36     9888 
 37     9860 
 38    10211 
 39    10378 
 40    10756 
 41    10503 
 42    10511 
 43    10296 
 44    10122 
 45    10074 
 46     9703 
 47     9527 
 48     9093 
 49     8776 
 50     8868

To reset the low and high values, repeat the previous steps with the original values:

censusWorkerInfo$age$low <- 20
censusWorkerInfo$age$high <- 65
rxSetVarInfo(censusWorkerInfo, "tempCensusWorkers.xdf")

Compute summary statistics

The rxSummary function provides descriptive statistics using a formula argument similar to that used in R’s modeling functions. The formula specifies the independent variables to summarize.

The basic structure of a formula is a tilde symbol "~" with one or more independent or right-hand variables, separated by "+". To include all of the variables, you can append a dot (.) to the tilde as "~.".

The rxSummary function also takes a data object as the source of the variables.

For example, returning to the CensusWorkers sample, run the following script to obtain a data summary of that file:

# Formulas in rxSummary
  
readPath <- rxGetOption("sampleDataDir")
censusWorkers <- file.path(readPath, "CensusWorkers.xdf")
rxSummary(~ age + incwage + perwt + sex + wkswork1, data = censusWorkers)

For each term in the formula, the mean, standard deviation, minimum, maximum, and number of valid observations is shown. If rxSummary includes byTerm=FALSE, the observations (rows) containing missing values for any of the specified variables are omitted in calculating the summary. Cell counts for categorical variables are included.

Results

Call:
rxSummary(formula = ~age + incwage + perwt + sex + wkswork1, 
    data = censusWorkers)

Summary Statistics Results for: ~age + incwage + perwt + sex + wkswork1
Data: censusWorkers (RxXdfData Data Source)
File name: C:/Program Files/Microsoft/ML Server/R_SERVER/library/RevoScaleR/SampleData/CensusWorkers.xdf
Number of valid observations: 351121 
 
 Name     Mean        StdDev       Min Max    ValidObs MissingObs
 age         40.42814    11.385017 20      65 351121   0         
 incwage  35333.83894 40444.544084  0  354000 351121   0         
 perwt       20.34423     9.633100  2     168 351121   0         
 wkswork1    48.62566     6.953843 21      52 351121   0         

Category Counts for sex
Number of categories: 2
Number of valid observations: 351121
Number of missing observations: 0

 sex    Counts
 Male   189344
 Female 161777

Compute median values

You might have noticed that rxSummary does not compute a median value for each variable. This is because rxSummary only produces statistics that can be computed in chunks, and a median computation is not a chunkable calculation. Median calculations are predicated on a sort operation, which is expensive on large data sets, and thus excluded from rxSummary.

Assuming your data set fits in memory, you could get the median value of a given variable using the base R median function:

# Load data into a dataframe
mydataframe <- rxImport(censusWorkers)

# As verification, print values of a numeric variable, such as age
mydataframe$age

# Compute the median age
median(mydataframe$age)

Results

[1] 40

For larger disk-bound datasets, you should use visualization techniques to uncover skewness in the underlying data.

Compute summary statistics on factor variables

You can obtain summary statistics on numeric data that are specific to levels within a variable, such as days of the week, months in a year, age or income levels constructed from column values, and so forth.

To do this, specify an interaction between a numeric variable and a factor variable. For example, using the sample data set AirlineDemoSmall.xdf, use the following command to request a summary of arrival delay by day of week:

rxSummary(~ ArrDelay:DayOfWeek, data = file.path(readPath, "AirlineDemoSmall.xdf")) 

Results

The request produces summary statistics for a factor variable, with output for each level.

Call:
rxSummary(formula = ~ArrDelay:DayOfWeek, data = file.path(readPath, 
    "AirlineDemoSmall.xdf"))

Summary Statistics Results for: ~ArrDelay:DayOfWeek
Data: file.path(readPath, "AirlineDemoSmall.xdf") (RxXdfData Data Source)
File name: C:/Program Files/Microsoft/ML Server/R_SERVER/library/RevoScaleR/SampleData/AirlineDemoSmall.xdf
Number of valid observations: 6e+05 
 
 Name               Mean     StdDev   Min Max  ValidObs MissingObs
 ArrDelay:DayOfWeek 11.31794 40.68854 -86 1490 582628   17372     

Statistics by category (7 categories):

 Category                      DayOfWeek   Means    StdDev   Min Max  ValidObs
 ArrDelay for DayOfWeek=Monday    Monday    12.025604 40.02463 -76 1017 95298   
 ArrDelay for DayOfWeek=Tuesday   Tuesday   11.293808 43.66269 -70 1143 74011   
 ArrDelay for DayOfWeek=Wednesday Wednesday 10.156539 39.58803 -81 1166 76786   
 ArrDelay for DayOfWeek=Thursday  Thursday   8.658007 36.74724 -58 1053 79145   
 ArrDelay for DayOfWeek=Friday    Friday    14.804335 41.79260 -78 1490 80142   
 ArrDelay for DayOfWeek=Saturday  Saturday  11.875326 45.24540 -73 1370 83851   
 ArrDelay for DayOfWeek=Sunday    Sunday    10.331806 37.33348 -86 1202 93395 

Note

Interactions provide the one exception to the “responseless” formula mentioned preceding. If you want to obtain the interaction of a continuous variable with one or more factors, you can use a formula of the form y ~ x:z, where y is the continuous variable and x and z are factors. This has precisely the same effect as specifying the formula as ~y:x:z, but is more suggestive of the result: specifically, summary statistics for y at every combination of levels of x and z.

Create a temporary factor variable

You can force RevoScaleR to treat a variable as a factor (with a level for each integer value from the low to high value) by wrapping it with the function call syntax F(). Returning to the CensusWorkers.xdf file, in the following example a factor level is temporarily created for each age from 20 through 65:

rxSummary(~ incwage:F(age), data = censusWorkers)

Results

The results provide not only summary statistics for wage income in the overall data set, but summary statistics on wage income for each age:

Call:
rxSummary(formula = ~incwage:F(age), data = censusWorkers)

Summary Statistics Results for: ~incwage:F(age)
Data: censusWorkers (RxXdfData Data Source)
File name: C:/Program Files/Microsoft/ML Server/R_SERVER/library/RevoScaleR/SampleData/CensusWorkers.xdf
Number of valid observations: 351121 
 
 Name           Mean     StdDev   Min Max    ValidObs MissingObs
 incwage:F(age) 35333.84 40444.54 0   354000 351121   0         

Statistics by category (46 categories):

 Category              F_age Means    StdDev   Min Max    ValidObs
 incwage for F(age)=20 20    12669.94 12396.99 0   336000  6500   
 incwage for F(age)=21 21    14114.23 12107.81 0   336000  6479   
 incwage for F(age)=22 22    15982.00 12374.14 0   336000  6676   
 incwage for F(age)=23 23    18503.92 15093.53 0   336000  6884   
 incwage for F(age)=24 24    20672.06 14315.67 0   354000  6931   
 incwage for F(age)=25 25    23856.25 17319.42 0   336000  7273   
 incwage for F(age)=26 26    25938.17 20707.39 0   354000  7116   
 incwage for F(age)=27 27    26902.97 20608.09 0   354000  7584   
 incwage for F(age)=28 28    28531.59 24185.48 0   354000  8184   
 incwage for F(age)=29 29    30153.10 25715.94 0   354000  8889   
 incwage for F(age)=30 30    30691.10 26955.27 0   354000  9055   
 . . .

If you include an interaction between two factors, the summary provides cell counts for all combinations of levels of the factors. Since the census data has probability weights, you can use the pweights argument to get weighted counts:

rxSummary(~ sex:state, pweights = "perwt", data = censusWorkers)

Results

Call:
rxSummary(formula = ~sex:state, data = censusWorkers, pweights = "perwt")

Summary Statistics Results for: ~sex:state
Data: censusWorkers (RxXdfData Data Source)
File name: C:/Program Files/Microsoft/ML Server/R_SERVER/library/RevoScaleR/SampleData/CensusWorkers.xdf
Probability weights: perwt
Number of valid observations: 351121 
 
Category Counts for sex
Number of categories: 6

 sex    state       Counts 
 Male   Connecticut  843736
 Female Connecticut  755843
 Male   Indiana     1517966
 Female Indiana     1289412
 Male   Washington  1504840
 Female Washington  1231489

Compute summary statistics on a row subset

The following example shows how to restrict the analysis to specific rows (in this case, people aged 30 to 39). You can use the low and high arguments of the F() function to restrict the creation of on-the-fly factor levels to the same range:

rxSummary(~ sex:F(age, low = 30, high = 39), data = censusWorkers,
      pweights="perwt", rowSelection = age >= 30 & age < 40)

Results

Call:
rxSummary(formula = ~sex:F(age, low = 30, high = 39), data = censusWorkers, 
    pweights = "perwt", rowSelection = age >= 30 & age < 40)

Summary Statistics Results for: ~sex:F(age, low = 30, high = 39)
Data: censusWorkers (RxXdfData Data Source)
File name: C:/Program Files/Microsoft/ML Server/R_SERVER/library/RevoScaleR/SampleData/CensusWorkers.xdf
Probability weights: perwt
Number of valid observations: 93896
 
Category Counts for sex
Number of categories: 20

 sex    F_age_30_39_T Counts
 Male   30            103242
 Female 30             84209
 Male   31            100234
 Female 31             77947
 Male   32             96325
 Female 32             75469
 Male   33             97734
 Female 33             77133
 Male   34            103380
 Female 34             81812
 Male   35            110358
 Female 35             89681
 Male   36            113444
 Female 36             91394
 Male   37            110828
 Female 37             91563
 Male   38            113838
 Female 38             95988
 Male   39            115552
 Female 39             97209

Note

The fourth argument in the function, exclude, defaults to TRUE. This is reflected in the T that appears in the output variable name.

Write summary statistics to XDF

By-group statistics are often computed for further analysis or plotting. It can be convenient to store these results in a .xdf file, especially when there are a large number of groups. In this example, compute the mean and standard deviation of wage income and number of weeks work for each year of age for both men and women using the CensusWorkers.xdf file with data from three states:

# Writing By-Group Summary Statistics to an .xdf File
  
readPath <- rxGetOption("sampleDataDir")
censusWorkers <- file.path(readPath, "CensusWorkers.xdf")
rxSummary(~ incwage:F(age):sex + wkswork1:F(age):sex, data = censusWorkers,
	byGroupOutFile = "ByAge.xdf", 
	summaryStats = c("Mean", "StdDev", "SumOfWeights"), 
	pweights = "perwt", overwrite = TRUE)

Results

Call:
rxSummary(formula = ~incwage:F(age):sex + wkswork1:F(age):sex, 
    data = censusWorkers, byGroupOutFile = "ByAge.xdf", summaryStats = c("Mean", 
        "StdDev", "SumOfWeights"), pweights = "perwt", overwrite = TRUE)

Summary Statistics Results for: ~incwage:F(age):sex + wkswork1:F(age):sex
Data: censusWorkers (RxXdfData Data Source)
File name: C:/Program Files/Microsoft/ML Server/R_SERVER/library/RevoScaleR/SampleData/CensusWorkers.xdf
Probability weights: perwt
Number of valid observations: 351121 
 
 Name               Mean       StdDev      SumOfWeights
 incwage:F_age:sex  35788.4675 40605.12565 7143286     
 wkswork1:F_age:sex    48.6373     6.94423 7143286     

By-group statistics for incwage:F(age):sex contained in C:\YourDir\ByAge.xdf 
By-group statistics for wkswork1:F(age):sex contained in C:\YourDir\ByAge.xdf

You can take a quick look at the first five rows in the data set to see that the first variables are the two factor variables determining the groups: F_age and sex. The remaining variables are the computed by-group statistics.

rxGetInfo("ByAge.xdf", numRows = 5)

Results

File name: C:\YourDir\ByAge.xdf 
Number of observations: 92 
Number of variables: 8 
Number of blocks: 1 
Compression type: zlib
Data (5 rows starting with row 1):
  F_age  sex incwage_Mean incwage_StdDev incwage_SumOfWeights wkswork1_Mean
1    20 Male     14437.68       14118.49                71089      44.29758
2    21 Male     15981.48       13191.73                71150      45.27770
3    22 Male     18258.04       13919.44                75979      46.07166
4    23 Male     20739.91       16511.88                79663      46.75025
5    24 Male     22737.17       15345.41                81412      47.51487
  wkswork1_StdDev wkswork1_SumOfWeights
1        9.755092                 71089
2        9.110743                 71150
3        8.903617                 75979
4        8.451298                 79663
5        7.942226                 81412

You can plot directly from the .xdf file to visualize the results:

rxLinePlot(incwage_Mean~F_age, groups = sex, data = "ByAge.xdf")

Transform data in-flight

You can use the transforms argument to modify your data set before computing a summary. When used in this way, the original data is unmodified and no permanent copy of the modified data is written to disk. The data summaries returned, however, reflect the modified data.

You can also transform data in the formula itself, by specifying simple functions of the original variables. For example, you can get a summary based on the natural logarithm of a variable as follows:

# Transform data in rxSummary
rxSummary(~ log(incwage), data = censusWorkers)

Results

Call:
rxSummary(formula = ~log(incwage), data = censusWorkers)

Summary Statistics Results for: ~log(incwage)
Data: censusWorkers (RxXdfData Data Source)
File name: C:/Program Files/Microsoft/ML Server/R_SERVER/library/RevoScaleR/SampleData/CensusWorkers.xdf
Number of valid observations: 351121 

 Name         Mean     StdDev    Min      Max      ValidObs MissingObs
 log(incwage) 10.19694 0.8387598 1.386294 12.77705 331625   19496

Compute and plot Lorenz curves

The Lorenz curves were originally developed to illustrate income inequality. For example, it can show us what percentage of total income is attributed to the lowest earning 10% of the population. The rxLorenz function from RevoScaleR provides a "big data" version, using approximate quantiles to quickly compute the cumulative distribution by quantile in a single pass through the data.

The rxLorenz function requires an orderVarName, the name of the variable used to compute the quantiles. A separate valueVarName can also be specified. This is the name of the variable used to compute the mean values by quantile. By default, the same variable is used for both. We can continue to use the Census Workers data set as an example, computing a Lorenz curve for the distribution of income:

lorenzOut <- rxLorenz(orderVarName = "incwage", data = censusWorkers,
	pweights = "perwt")
head(lorenzOut)

    cumVals  percents
1 0.0000000  0.000000
2 0.3642878  9.126934
3 0.4005827  9.363072
4 0.5368416 10.181295
5 0.5666976 10.353428
6 0.6241598 10.667766

The returned object contains the cumulative values and the percentages. Using the plot method for rxLorenz, we can get a visual representation of the income distribution and compare it with the horizontal line representing perfect equality:

plot(lorenzOut)

The Gini coefficient is often used as a summary statistic for Lorenz curves. It is computed by estimating the ratio of the area between the line of equality and the Lorenz curve to the total area under the line of equality (using trapezoidal integration). The Gini coefficient can range from 0 to 1, with 0 representing perfect equality. We can compute it from using the output from rxLorenz:

giniCoef <- rxGini(lorenzOut)
giniCoef

  [1] 0.4491421

Tips for wide data sets

Building a better understanding of the distribution of each variable and the relationships between variables improves decision making during the modeling phase. This is especially true for wide data sets containing hundreds if not thousands of variables. For wide data sets, the main goal of data exploration should be to find any outliers or influential data points, identify redundant variables or correlated variables and transform or combine any variables that seem appropriate.

The functions rxGetVarInfo and rxSummary provide useful information can help in this effort, but these two functions may need to be used differently when the data contain many variables.

As a first step, import the data to an .xdf so that you can execute functions providing metadata and summary statistics. Recall that rxGetVarInfo returns metadata about the data object. After loading data into an XDF data source, the data type information can easily be accessed using the rxGetVarInfo function.

Because wide data has so many variables, printed output can be hard to read. As an alternative, save the variable information to an object that can serve as in informal data dictionary. We demonstrate this using the Claims data from the sample data directory:

readPath <- rxGetOption("sampleDataDir") 
censusWorkers <- file.path(readPath, "CensusWorkers.xdf") 
censusDataDictionary <-rxGetVarInfo(censusWorkers)

We can then obtain the information for an individual variable as follows:

censusDataDictionary$age

Age 
       Type: integer, Low/High: (20, 65)

The rxSummary function is a great way to look at the distribution of individual variables and identify outliers. With wide data, you want to store the results of this function into an object. This object contains a data frame with the results of the numeric variables, “sDataFrame”, and a list of data frames with the counts for each categorical variable,”categorical”:

readPath <- rxGetOption("sampleDataDir") 
censusWorkers <- file.path(readPath, "CensusWorkers.xdf") 
censusSummary <- rxSummary(~ age + incwage + perwt + sex + wkswork1, 
    data = censusWorkers)
names(censusSummary)

[1] "nobs.valid"       "nobs.missing"     "sDataFrame"       "categorical"     
[5] "params"           "formula"          "call"             "categorical.type"

Printing the rxSummary results to the console wouldn’t be useful with so many variables. Saving the results in an object allows us to not only access the summary results programmatically, but also to view the results separately for numeric and categorical variables. We access the sDataFrame (printed to show structure) as follows:

censusSummary$sDataFrame
      Name        Mean       StdDev Min    Max ValidObs MissingObs
1      age    40.42814    11.385017  20     65   351121          0
2  incwage 35333.83894 40444.544084   0 354000   351121          0
3    perwt    20.34423     9.633100   2    168   351121          0
4      sex          NA           NA  NA     NA   351121          0
5 wkswork1    48.62566     6.953843  21     52   351121          0

To view the categorical variables, we access the categorical component:

censusSummary$categorical
[[1]]
     sex Counts
1   Male 189344
2 Female 161777

Another key piece of data exploration for wide data is looking at the relationships between variables to find variables that are measuring the same information or variables that are correlated. With variables that are measuring the same information, perhaps one stands out as being representative of the group. During data exploration, you must rely on your domain knowledge to group variables into related sets or prioritize variables that are important based on the field or industry. Paring the data set down to related sets allow you to look more closely at redundancy and relatedness within each set.

When looking for correlation between variables the function rxCrosstabs is useful. In Crosstabs, you see how to use rxCrosstabs and rxLinePlot to graph the relationship between two variables. Graphs allow for a quick view of the relationship between two variables, which comes in handy when you have many variables to consider. For more information, see Visualizing Huge Data Sets: An Example from the U.S. Census.

See also