Fitting Linear Models using RevoScaleR

Linear regression models are fitted in RevoScaleR using the rxLinMod function. Like other RevoScaleR functions, rxLinMod uses an updating algorithm to compute the regression model. The R object returned by rxLinMod includes the estimated model coefficients and the call used to generate the model, together with other information that allows RevoScaleR to recompute the model—because rxLinMod is designed to work with arbitrarily large data sets, quantities such as residuals, and fitted values are not included in the return object, although these can be obtained easily once the model has been fitted.

As a simple example, let’s use the sample data set AirlineDemoSmall.xdf and fit the arrival delay by day of week:

# Fitting Linear Models

readPath <- rxGetOption("sampleDataDir")
airlineDemoSmall <- file.path(readPath, "AirlineDemoSmall.xdf")
rxLinMod(ArrDelay ~ DayOfWeek, data = airlineDemoSmall)

  Call:
  rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airlineDemoSmall)

  Linear Regression Results for: ArrDelay ~ DayOfWeek
  File name:
      C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\ RevoScaleR\SampleData\AirlineDemoSmall.xdf
  Dependent variable(s): ArrDelay
  Total independent variables: 8 (Including number dropped: 1)
  Number of valid observations: 582628
  Number of missing observations: 17372

  Coefficients:
                        ArrDelay
  (Intercept)         10.3318058
  DayOfWeek=Monday     1.6937981
  DayOfWeek=Tuesday    0.9620019
  DayOfWeek=Wednesday -0.1752668
  DayOfWeek=Thursday  -1.6737983
  DayOfWeek=Friday     4.4725290
  DayOfWeek=Saturday   1.5435207
  DayOfWeek=Sunday       Dropped

Because our predictor is categorical, we can use the cube argument to rxLinMod to perform the regression using a partitioned inverse, which may be faster and may use less memory than the standard algorithm. The output object also includes a data frame with the averages or counts for each category:

arrDelayLm1 <- rxLinMod(ArrDelay ~ DayOfWeek, cube = TRUE,
    data = airlineDemoSmall)

Typing the name of the object arrDelayLm1 yields the following output:

Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airlineDemoSmall,
    cube = TRUE)

Cube Linear Regression Results for: ArrDelay ~ DayOfWeek
File name:
    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\ RevoScaleR\SampleData\AirlineDemoSmall.xdf
Dependent variable(s): ArrDelay
Total independent variables: 7
Number of valid observations: 582628
Number of missing observations: 17372

Coefficients:
                     ArrDelay
DayOfWeek=Monday    12.025604
DayOfWeek=Tuesday   11.293808
DayOfWeek=Wednesday 10.156539
DayOfWeek=Thursday   8.658007
DayOfWeek=Friday    14.804335
DayOfWeek=Saturday  11.875326
DayOfWeek=Sunday    10.331806

Obtaining a Summary of the Model

The print method for the rxLinMod model object shows only the call and the coefficients. You can obtain more information about the model by calling the summary method:

# Obtaining a Summary of a Model

summary(arrDelayLm1)

This produces the following output, which includes substantially more information about the model coefficients, together with the residual standard error, multiple R-squared, and adjusted R-squared:

Call:
rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airlineDemoSmall,
    cube = TRUE)

Cube Linear Regression Results for: ArrDelay ~ DayOfWeek
File name:
    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\ RevoScaleR\SampleData\AirlineDemoSmall.xdf
Dependent variable(s): ArrDelay
Total independent variables: 7
Number of valid observations: 582628
Number of missing observations: 17372

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)     | Counts
DayOfWeek=Monday     12.0256     0.1317   91.32 2.22e-16 *** |  95298
DayOfWeek=Tuesday    11.2938     0.1494   75.58 2.22e-16 *** |  74011
DayOfWeek=Wednesday  10.1565     0.1467   69.23 2.22e-16 *** |  76786
DayOfWeek=Thursday    8.6580     0.1445   59.92 2.22e-16 *** |  79145
DayOfWeek=Friday     14.8043     0.1436  103.10 2.22e-16 *** |  80142
DayOfWeek=Saturday   11.8753     0.1404   84.59 2.22e-16 *** |  83851
DayOfWeek=Sunday     10.3318     0.1330   77.67 2.22e-16 *** |  93395
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.65 on 582621 degrees of freedom
Multiple R-squared: 0.001869 (as if intercept included)
Adjusted R-squared: 0.001858
F-statistic: 181.8 on 6 and 582621 DF,  p-value: < 2.2e-16
Condition number: 1

Using Probability Weights

Probability weights are common in survey data; they represent the probability that a case was selected into the sample from the population, and are calculated as the inverse of the sampling fraction. The variable perwt in the census data represents a probability weight. You pass probability weights to the RevoScaleR analysis functions using the pweights argument, as in the following example:

#  Using Probability Weights

rxLinMod(incwage ~ F(age), pweights = "perwt", data = censusWorkers)

Yields the following output:

Call:
rxLinMod(formula = incwage ~ F(age), data = censusWorkers, pweights = "perwt")

Linear Regression Results for: incwage ~ F(age)
File name:
    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
Probability weights: perwt
Dependent variable(s): incwage
Total independent variables: 47 (Including number dropped: 1)
Number of valid observations: 351121
Number of missing observations: 0

Coefficients:
                incwage
(Intercept)  32111.3012
F_age=20    -19488.0874
F_age=21    -18043.7738
F_age=22    -15928.5538
F_age=23    -13498.2138
F_age=24    -11248.8583
F_age=25     -7948.6603
F_age=26     -5667.3599
F_age=27     -4682.8375
F_age=28     -3024.1648
F_age=29     -1480.8682
F_age=30      -971.4461
F_age=31       143.5648
F_age=32      2009.0019
F_age=33      2861.3031
F_age=34      2797.9986
F_age=35      4038.2131
F_age=36      5511.8633
F_age=37      5148.1841
F_age=38      5957.2190
F_age=39      7652.9870
F_age=40      6969.9630
F_age=41      8387.6821
F_age=42      9150.0006
F_age=43      8936.7590
F_age=44      9267.0820
F_age=45     10148.1702
F_age=46      9099.0659
F_age=47      9996.0450
F_age=48     10408.1712
F_age=49     10281.1324
F_age=50     12029.6751
F_age=51     10247.2529
F_age=52     12105.0654
F_age=53     10957.8211
F_age=54     11881.4307
F_age=55     11490.8457
F_age=56      9442.3856
F_age=57      9331.2347
F_age=58      9353.1980
F_age=59      7526.4912
F_age=60      6078.4463
F_age=61      4636.6756
F_age=62      3129.5531
F_age=63      2535.4858
F_age=64      1520.3707
F_age=65        Dropped

Using Frequency Weights

Frequency weights are useful when your data has a particular form: a set of data in which one or more cases is exactly replicated. If you then compact the data to remove duplicated observations and create a variable to store the number of replications of each observation, you can use that new variable as a frequency weights variable in the RevoScaleR analysis functions.

For example, the sample data set fourthgraders.xdf contains the following 44 rows:

   height eyecolor   male reps
14     44     Blue   Male    1
23     44    Brown   Male    3
9      44    Brown Female    1
38     44    Green Female    1
8      45     Blue   Male    1
31     45     Blue Female    5
2      45    Brown   Male    2
51     45    Brown Female    1
6      45    Green   Male    1
47     45    Hazel Female    1
37     46     Blue   Male    1
21     46     Blue Female    3
12     46    Brown   Male    2
5      46    Brown Female    4
19     46    Green   Male    1
80     46    Green Female    3
64     47     Blue   Male    1
69     47     Blue Female    3
39     47    Brown   Male    4
18     47    Brown Female    3
41     47    Green   Male    2
7      47    Green Female    4
26     47    Hazel   Male    3
17     48     Blue   Male    4
36     48     Blue Female    3
13     48    Brown   Male    6
62     48    Brown Female    4
61     48    Green Female    2
56     48    Hazel   Male    1
10     49     Blue   Male    1
15     49     Blue Female    5
22     49    Brown   Male    2
4      49    Brown Female    5
3      49    Green   Male    3
91     49    Green Female    1
94     50     Blue   Male    1
96     50     Blue Female    1
97     50    Brown   Male    1
1      50    Brown Female    1
28     50    Green   Male    4
52     50    Hazel   Male    1
98     51     Blue   Male    1
57     51    Brown Female    1
90     51    Green Female    1

The reps column shows the number of replications for each observation; the sum of the reps column indicates the total number of observations, in this case 100. We can fit a model (admittedly not very useful) of height on eye color with rxLinMod as follows:

#  Using Frequency Weights

fourthgraders <- file.path(rxGetOption("sampleDataDir"),
    "fourthgraders.xdf")
fourthgradersLm <- rxLinMod(height ~ eyecolor, data = fourthgraders,
    fweights="reps")

Typing the name of the object shows the following output:

Call:
rxLinMod(formula = height ~ eyecolor, data = fourthgraders, fweights = "reps")

Linear Regression Results for: height ~ eyecolor
File name:
    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\fourthgraders.xdf
Linear Regression Results for: height ~ eyecolor
Frequency weights: reps
Dependent variable(s): height
Total independent variables: 5 (Including number dropped: 1)
Number of valid observations: 44
Number of missing observations: 0

Coefficients:
                    height
(Intercept)    47.33333333
eyecolor=Blue  -0.01075269
eyecolor=Brown -0.08333333
eyecolor=Green  0.40579710
eyecolor=Hazel     Dropped

Using rxLinMod with R Data Frames

While RevoScaleR is primarily designed to work with data on disk, it can also be used with in-memory R data frames. As an example, consider the R sample data frame “cars”, which contains data recorded in the 1920s on the speed of cars and the distances taken to stop.

#  Using rxLinMod with R Data Frames

rxLinMod(dist ~ speed, data = cars)

We get the following output (which matches the output given by the R function lm):

Call:
rxLinMod(formula = dist ~ speed, data = cars)

Linear Regression Results for: dist ~ speed
Data: cars
Dependent variable(s): dist
Total independent variables: 2
Number of valid observations: 50
Number of missing observations: 0

Coefficients:
                  dist
(Intercept) -17.579095
speed         3.932409

Using the Cube Option for Conditional Predictions

If the cube argument is set to TRUE and the first term of the independent variables is categorical, rxLinMod will compute and return a data frame with category counts. If there are no other independent variables, or if the cubePredictions argument is set to TRUE, the data frame will also contain predicted values. Let’s create a simple data frame to illustrate:

#  Using the Cube Option for Conditional Predictions

xfac1 <- factor(c(1,1,1,1,2,2,2,2,3,3,3,3), labels=c("One1", "Two1", "Three1"))
xfac2 <- factor(c(1,1,1,1,1,1,2,2,2,3,3,3), labels=c("One2", "Two2", "Three2"))
set.seed(100)
y <- as.integer(xfac1) + as.integer(xfac2)* 2 + rnorm(12)
myData <- data.frame(y, xfac1, xfac2)

If we estimate a simple linear regression of y on xfac1, the coefficients are equal to the within- group means:

myLinMod <- rxLinMod(y ~ xfac1, data = myData, cube = TRUE)
myLinMod

  Call:
  rxLinMod(formula = y ~ xfac1, data = myData, cube = TRUE)

  Cube Linear Regression Results for: y ~ xfac1
  Data: myData
  Dependent variable(s): y
  Total independent variables: 3
  Number of valid observations: 12
  Number of missing observations: 0

  Coefficients:
                      y
  xfac1=One1   3.109302
  xfac1=Two1   5.142086
  xfac1=Three1 8.250260

In addition to the standard output, the returned object contains a countDF data frame with information on within-group means and counts in each category. In this simple case the within-group means are the same as the coefficients:

myLinMod$countDF
   xfac1        y Counts
1   One1 3.109302      4
2   Two1 5.142086      4
3 Three1 8.250260      4

Using rxLinMod with the cube option also allows us to compute conditional within-group means. For example:

myLinMod1 <- rxLinMod(y~xfac1 + xfac2, data = myData,
    cube = TRUE, cubePredictions = TRUE)
myLinMod1

  Call:
  rxLinMod(formula = y ~ xfac1 + xfac2, data = myData, cube = TRUE,
  cubePredictions=TRUE)

  Cube Linear Regression Results for: y ~ xfac1 + xfac2
  Data: myData
  Dependent variable(s): y
  Total independent variables: 6 (Including number dropped: 1)
  Number of valid observations: 12
  Number of missing observations: 0

  Coefficients:
                       y
  xfac1=One1    7.725231
  xfac1=Two1    8.833730
  xfac1=Three1  8.942099
  xfac2=One2   -4.615929
  xfac2=Two2   -2.767359
  xfac2=Three2   Dropped

If we look at the countDF, we see the within-group means, conditional on the average value of the conditioning variable, xfac2:

myLinMod1$countDF
     xfac1        y Counts
  1   One1 4.725427      4
  2   Two1 5.833926      4
  3 Three1 5.942295      4

For the variable xfac2, 50% of the observations have the value “One2”, 25% of the observations have the value “Two2”, and 25% have the value “Three2”. We can compute the weighted average of the coefficients for xfac2 as:

myCoef <- coef(myLinMod1)
avexfac2c <- .5*myCoef[4] + .25*myCoef[5]

To compute the conditional within-group mean shown in the countDF for xfac1 equal to “One1”, we add this to the coefficient computed for “One1”:

condMean1 <- myCoef[1] + avexfac2c
condMean1

  xfac1=One1
    4.725427

Conditional within-group means can also be computed using additional continuous independent variables.

Fitted Values, Residuals, and Prediction

When you fit a model with lm or any of the other core R model-fitting functions, you get back an object that includes as components both the fitted values for the response variable and the model residuals. For models fit with rxLinMod or other RevoScaleR functions, it is usually impractical to include these components, as they can be many megabytes in size. Instead, they are computed on demand using the rxPredict function. This function takes an rxLinMod object as its first argument, an input data set as its second argument, and an output data set as its third argument. If the input data set is the same as the data set used to fit the rxLinMod object, the resulting predictions are the fitted values for the model. If the input data set is a different data set (but one containing the same variable names used in fitting the rxLinMod object), the resulting predictions are true predictions of the response for the new data from the original model. In either case, residuals for the predicted values can be obtained by setting the flag computeResiduals to TRUE.

For example, we can draw from the 7% sample of the large airline data set (available online) training and prediction data sets as follows (remember to customize the first line below for your own system):

bigDataDir <- "C:/MRS/Data"
sampleAirData <- file.path(bigDataDir, "AirOnTime7Pct.xdf")
trainingDataFile <- "AirlineData06to07.xdf"
targetInfile <- "AirlineData08.xdf"

rxDataStep(sampleAirData, trainingDataFile, rowSelection = Year == 1999 |
    Year == 2000 | Year == 2001 | Year == 2002 | Year == 2003 |
    Year == 2004 | Year == 2005 | Year == 2006 | Year == 2007)
rxDataStep(sampleAirData, targetInfile, rowSelection = Year == 2008)

We can then fit a linear model with the training data and compute predicted values on the prediction data set as follows:

arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek + UniqueCarrier + Dest,
    data = trainingDataFile)
rxPredict(arrDelayLm2, data = targetInfile, outData = targetInfile)

To see the first few rows of the result, use rxGetInfo as follows:

rxGetInfo(targetInfile,  numRows = 5)

  File name: C:\MRS\Data\OutData\AirlineData08.xdf
  Number of observations: 489792
  Number of variables: 14
  Number of blocks: 2
  Compression type: zlib
  Data (5 rows starting with row 1):
    Year DayOfWeek UniqueCarrier Origin Dest CRSDepTime DepDelay TaxiOut TaxiIn
  1 2008       Mon            9E    ATL  HOU   7.250000       -2      14      5
  2 2008       Mon            9E    ATL  HOU   7.250000       -2      16      2
  3 2008       Sat            9E    ATL  HOU   7.250000        0      18      5
  4 2008       Wed            9E    ATL  SRQ  12.666667        6      20      4
  5 2008       Sat            9E    HOU  ATL   9.083333       -9      15      9
    ArrDelay ArrDel15 CRSElapsedTime Distance ArrDelay_Pred
  1      -15    FALSE            140      696      8.981548
  2        0    FALSE            140      696      8.981548
  3        0    FALSE            140      696      5.377572
  4        6    FALSE             86      445      7.377952
  5      -23    FALSE            120      696      6.552242

Prediction Standard Errors, Confidence Intervals, and Prediction Intervals

You can also use rxPredict to obtain prediction standard errors, provided you have included the variance-covariance matrix in the original rxLinMod fit. If you choose to compute the prediction standard errors, you can also obtain either of two kinds of intervals: confidence intervals that for a given confidence level tells us how confident we are that the expected value is within the given interval, and prediction intervals that specify, for a given confidence level, how likely future observations are to fall within the interval given what has already been observed. Standard error computations are computationally intensive, and they may become prohibitive on large data sets with a large number of predictors. To illustrate the computation, we start with a small data set, using the example on page 132 of Introduction to the Practice of Statistics, (5th Edition). The predictor, neaInc, is the increase in “non-exercise activity” in response to an increase in caloric intake. The response, fatGain, is the associated increase in fat. We first read in the data and create a data frame to use in our analysis:

# Standard Errors, Confidence Intervals, and Prediction Intervals

neaInc <- c(-94, -57, -29, 135, 143, 151, 245, 355, 392, 473, 486, 535, 571,
    580, 620, 690)
fatGain <- c( 4.2, 3.0, 3.7, 2.7, 3.2, 3.6, 2.4, 1.3, 3.8, 1.7, 1.6, 2.2, 1.0,
    0.4, 2.3, 1.1)
ips132df <- data.frame(neaInc = neaInc, fatGain=fatGain)

Next we fit the linear model with rxLinMod, setting covCoef=TRUE to ensure we have the variance-covariance matrix in our model object:

ips132lm <- rxLinMod(fatGain ~ neaInc, data=ips132df, covCoef=TRUE)

Now we use rxPredict to obtain the fitted values, prediction standard errors, and confidence intervals. By setting writeModelVars to TRUE, the variables used in the model will also be included in the output data set. In this first example, we obtain confidence intervals:

ips132lmPred <- rxPredict(ips132lm, data=ips132df, computeStdErrors=TRUE,
    interval="confidence", writeModelVars = TRUE)

The standard errors are by default put into a variable named by concatenating the name of the response variable with an underscore and the string “StdErr”:

ips132lmPred$fatGain_StdErr

 [1] 0.3613863 0.3381111 0.3209344 0.2323853 0.2288433 0.2254018 0.1941840
 [8] 0.1863180 0.1915656 0.2151569 0.2202366 0.2418892 0.2598922 0.2646223
[15] 0.2865818 0.3279390

We can view the original data, the fitted prediction line, and the confidence intervals as follows:

rxLinePlot(fatGain + fatGain_Pred + fatGain_Upper + fatGain_Lower ~ neaInc,
    data = ips132lmPred, type = "b",
    lineStyle = c("blank", "solid", "dotted", "dotted"),
    lineColor = c(NA, "red", "black", "black"),
    symbolStyle = c("solid circle", "blank", "blank", "blank"),
    title = "Data, Predictions, and Confidence Bounds",
    xTitle = "Increase in Non-Exercise Activity",
    yTitle = "Increse in Fat", legend = FALSE)

The resulting plot is shown below:

The prediction intervals can be obtained and plotted as follows:

ips132lmPred2 <- rxPredict(ips132lm, data=ips132df, computeStdErrors=TRUE,
    interval="prediction", writeModelVars = TRUE)
rxLinePlot(fatGain + fatGain_Pred + fatGain_Upper + fatGain_Lower ~ neaInc,
    data = ips132lmPred2, type = "b",
    lineStyle = c("blank", "solid", "dotted", "dotted"),
    lineColor = c(NA, "red", "black", "black"),
    symbolStyle = c("solid circle", "blank", "blank", "blank"),
    title = "Prediction Intervals",
    xTitle = "Increase in Non-Exercise Activity",
    yTitle = "Increse in Fat", legend = FALSE)

The resulting plot is shown below:

We can fit the prediction standard errors on our large airline regression model if we first refit it with covCoef=TRUE:

arrDelayLmVC <- rxLinMod(ArrDelay ~ DayOfWeek + UniqueCarrier + Dest,
    data = trainingDataFile, covCoef=TRUE)

We can then obtain the prediction standard errors and a confidence interval as before:

rxPredict(arrDelayLmVC, data = targetInfile, outData = targetInfile,
    computeStdErrors=TRUE, interval = "confidence", overwrite=TRUE)

We can then look at the first few lines of targetInfile to see the first few predictions and standard errors:

rxGetInfo(targetInfile, numRows=10)

  File name: C:\YourOutputPath\AirlineData08.xdf
  Number of observations: 489792
  Number of variables: 17
  Number of blocks: 2
  Compression type: zlib
  Data (10 rows starting with row 1):
     Year DayOfWeek UniqueCarrier Origin Dest CRSDepTime DepDelay TaxiOut TaxiIn
  1  2008       Mon            9E    ATL  HOU   7.250000       -2      14      5
  2  2008       Mon            9E    ATL  HOU   7.250000       -2      16      2
  3  2008       Sat            9E    ATL  HOU   7.250000        0      18      5
  4  2008       Wed            9E    ATL  SRQ  12.666667        6      20      4
  5  2008       Sat            9E    HOU  ATL   9.083333       -9      15      9
  6  2008       Mon            9E    ATL  IAH  16.416666       -3      25      4
  7  2008      Thur            9E    IAH  ATL  18.500000        9      12      7
  8  2008       Sat            9E    IAH  ATL  18.500000       -3      19      8
  9  2008       Mon            9E    IAH  ATL  18.500000       -5      16      6
  10 2008      Thur            9E    ATL  IAH  13.250000       -2      15      7
     ArrDelay ArrDel15 CRSElapsedTime Distance ArrDelay_Pred ArrDelay_StdErr
  1       -15    FALSE            140      696      8.981548       0.3287748
  2         0    FALSE            140      696      8.981548       0.3287748
  3         0    FALSE            140      696      5.377572       0.3293198
  4         6    FALSE             86      445      7.377952       0.6380760
  5       -23    FALSE            120      696      6.552242       0.2838803
  6       -11    FALSE            145      689      7.530244       0.2952031
  7        -8    FALSE            125      689     12.168922       0.2833356
  8       -16    FALSE            125      689      6.552242       0.2838803
  9       -24    FALSE            125      689     10.156218       0.2833283
  10       21     TRUE            130      689      9.542948       0.2952058
     ArrDelay_Lower ArrDelay_Upper
  1        8.337161       9.625935
  2        8.337161       9.625935
  3        4.732117       6.023028
  4        6.127346       8.628559
  5        5.995847       7.108638
  6        6.951657       8.108832
  7       11.613594      12.724249
  8        5.995847       7.108638
  9        9.600905      10.711532
  10       8.964355      10.121540

Stepwise Variable Selection

Stepwise linear regression is an algorithm that helps you determine which variables are most important to a regression model. You provide a minimal, or lower, model formula and a maximal, or upper, model formula, and using forward selection, backward elimination, or bidirectional search, the algorithm determines the model formula that provides the best fit based on an AIC selection criterion.

In SAS, stepwise linear regression is implemented through PROC REG. In open source R, it is implemented through the function step. The problem with using the function step in R is that the size of the data set that can be analyzed is severely limited by the requirement that all computations must be done in memory.

RevoScaleR provides an implementation of stepwise linear regression that is not constrained by the use of "in-memory" algorithms. Stepwise linear regression in RevoScaleR is implemented by the functions rxLinMod and rxStepControl.

Stepwise linear regression begins with an initial model of some sort. Consider, for example, the airline training data set AirlineData06to07.xdf we created in section 8.6:

#  Stepwise Linear Regression

rxGetVarInfo(trainingDataFile)
  Var 1: Year, Type: integer, Low/High: (1999, 2007)
  Var 2: DayOfWeek
         7 factor levels: Mon Tues Wed Thur Fri Sat Sun
  Var 3: UniqueCarrier
         30 factor levels: AA US AS CO DL ... OH F9 YV 9E VX
  Var 4: Origin
         373 factor levels: JFK LAX HNL OGG DFW ... IMT ISN AZA SHD LAR
  Var 5: Dest
         377 factor levels: LAX HNL JFK OGG DFW ... ESC IMT ISN AZA SHD
  Var 6: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 23.9833)
  Var 7: DepDelay, Type: integer, Low/High: (-1199, 1930)
  Var 8: TaxiOut, Type: integer, Low/High: (0, 1439)
  Var 9: TaxiIn, Type: integer, Low/High: (0, 1439)
  Var 10: ArrDelay, Type: integer, Low/High: (-926, 1925)
  Var 11: ArrDel15, Type: logical, Low/High: (0, 1)
  Var 12: CRSElapsedTime, Type: integer, Low/High: (-34, 1295)
  Var 13: Distance, Type: integer, Low/High: (11, 4962)

We are interested in fitting a model that will predict arrival delay (ArrDelay) as a function of some of the other variables. To keep things simple, we’ll start with our by now familiar model of arrival delay as a function of DayOfWeek and CRSDepTime:

initialModel <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
    data = trainingDataFile)
initialModel

  Call:
  rxLinMod(formula = ArrDelay ~ DayOfWeek + CRSDepTime, data = trainingDataFile)

  Linear Regression Results for: ArrDelay ~ DayOfWeek + CRSDepTime
  File name:
      C:\MyWorkingDir\Documents\MRS\LMPrediction\AirlineData06to07.xdf
  Dependent variable(s): ArrDelay
  Total independent variables: 9 (Including number dropped: 1)
  Number of valid observations: 3945964
  Number of missing observations: 98378

  Coefficients:
                    ArrDelay
  (Intercept)    -4.91416806
  DayOfWeek=Mon   0.49172258
  DayOfWeek=Tues -1.41878850
  DayOfWeek=Wed   0.09677481
  DayOfWeek=Thur  2.52841304
  DayOfWeek=Fri   3.29474667
  DayOfWeek=Sat  -2.86217838
  DayOfWeek=Sun      Dropped
  CRSDepTime      0.86703378

The question is, can we improve this model by adding more predictors? If so, which ones? To use stepwise selection in RevoScaleR, you add the variableSelection argument to your call to rxLinMod. The variableSelection argument is a list, most conveniently created by using the rxStepControl function. Using rxStepControl, you specify the method (the default, "stepwise", specifies a bidirectional search), the scope (lower and upper formulas for the search), and various control parameters. With our model, we first want to try some more numeric predictors, so we specify our model as follows:

airlineStepModel <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
    data = trainingDataFile,
    variableSelection = rxStepControl(method="stepwise",
        scope = ~ DayOfWeek + CRSDepTime + CRSElapsedTime +
            Distance + TaxiIn + TaxiOut ))
  Call:
  rxLinMod(formula = ArrDelay ~ DayOfWeek + CRSDepTime, data = trainingDataFile,
      variableSelection = rxStepControl(method = "stepwise", scope = ~DayOfWeek +
          CRSDepTime + CRSElapsedTime + Distance + TaxiIn + TaxiOut))

  Linear Regression Results for: ArrDelay ~ DayOfWeek + CRSDepTime +
      CRSElapsedTime + Distance + TaxiIn + TaxiOut
  File name:
      C:\MyWorkingDir\Documents\MRS\LMPrediction\AirlineData06to07.xdf
  Dependent variable(s): ArrDelay
  Total independent variables: 13 (Including number dropped: 1)
  Number of valid observations: 3945964
  Number of missing observations: 98378

  Coefficients:
                    ArrDelay
  (Intercept)    -9.87474950
  DayOfWeek=Mon   0.09830827
  DayOfWeek=Tues -1.81998781
  DayOfWeek=Wed  -0.62761606
  DayOfWeek=Thur  1.53941124
  DayOfWeek=Fri   2.45565510
  DayOfWeek=Sat  -2.20111789
  DayOfWeek=Sun      Dropped
  CRSDepTime      0.75995355
  CRSElapsedTime -0.20256102
  Distance        0.02135381
  TaxiIn          0.16194266
  TaxiOut         0.99814931

Methods of Variable Selection

Three methods of variable selection are supported by rxLinMod:

  • "forward": starting from the minimal model, variables are added one at a time until no additional variable satisfies the selection criterion, or until the maximal model is reached.

  • "backward": starting from the maximal model, variables are removed one at a time until the removal of another variable won’t satisfy the selection criterion, or until the minimal model is reached.

  • "stepwise" (the default): a combination of forward and backward selection, in which variables are added to the minimal model, but at each step, the model is reanalyzed to see if any variables that have been added are candidates for deletion from the current model.

You specify the desired method by supplying a named component "method" in the list supplied for the variableSelection argument, or by specifying the method argument in a call to rxStepControl that is then passed as the variableSelection argument.

Variable Selection with Wide Data

We’ve found that generalized linear models do not converge if the number of predictors is greater than the number of observations. If your data has more variables, it won’t be possible to include all of them in the maximal model for stepwise selection. We recommend you use domain experience and insights from initial data explorations to choose a subset of the variables to serve as the maximal model before performing stepwise selection.

There are a few things you can do to reduce the number of predictors. If there are many variables that measure the same quantitative or qualitative entity, try to select one variable that represents the entity best. For example, include a variable that identifies a person’s political party affiliation instead of including many variables representing how the person feels about individual issues. If your goal with linear modeling is the interpretation of individual predictors, you want to ensure that correlation between variables in the model is minimal to avoid multicollinearity. This means checking for these correlations before modeling. Sometimes it is useful to combine correlated variables into a composite variable. Height and weight are often correlated, but can be transformed into BMI. Combining variables allows you to reduce the number of variables without losing any information. Ultimately, the variables you select will depend on how you plan to use the results of your linear model.

Specifying Model Scope

You use the scope argument in rxStepControl (or a named component "scope" in the list supplied for the variableSelection argument) to specify which variables should be considered for inclusion in the final model selection and which should be ignored. Whether you specify a separate value for scope also determines which models the algorithm tries next.

You can specify the scope as a simple formula (which will be treated as the upper bound, or maximal model), or as a named list with components "lower" and "upper" (either of which may be missing). For example, to analyze the iris data with a minimal model involving the single predictor Sepal.Width and a maximal model involving Sepal.Width, Petal.Length, and the interaction between Petal.Width and Species, we can specify our variable selection model as follows:

#   Specifying Model Scope

form <- Sepal.Length ~ Sepal.Width + Petal.Length
scope <- list(
    lower = ~ Sepal.Width,
    upper = ~ Sepal.Width + Petal.Length + Petal.Width * Species)

varsel <- rxStepControl(method = "stepwise", scope = scope)
rxlm.step <- rxLinMod(form, data = iris, variableSelection = varsel,
    verbose = 1, dropMain = FALSE, coefLabelStyle = "R")

In general, the models considered are determined from the scope argument as follows:

  • "lower" scope only: All models considered will include this lower model up to the base model specified in the formula argument to rxLinMod.

  • "upper" scope only: All models considered will include the terms in the base model (specified by the formula argument to rxLinMod), plus additional terms as specified in the upper scope.

  • Both "lower" and "upper" scope supplied: All models considered will include at least the terms in the lower scope and additional terms are added using terms from the upper scope until the stopping criterion is reached or the full upper scope model is selected.

  • No scope supplied: The minimal model includes just the intercept term and the maximal model include at most the terms specified in the base model.

(This convention is identical to that used by R’s step function.)

Specifying the Selection Criterion

By default, variable selection is determined using the Akaike Information Criterion, or AIC; this is the R standard. If you want a stepwise selection that is more SAS-like, you can specify stepCriterion="SigLevel". If this is set, rxLinMod uses either an F-test (default) or Chi-square test to determine whether to add or drop terms from the model. You can specify which test to perform using the test argument. For example, we can refit our airline model using the SigLevel step criterion with an F-test as follows:

#  
# Specifying selection criterion
#  
airlineStepModelSigLevel <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
    data = trainingDataFile, variableSelection =
        rxStepControl( method = "stepwise", scope = ~ DayOfWeek +
            CRSDepTime + CRSElapsedTime + Distance + TaxiIn + TaxiOut,
            stepCriterion = "SigLevel" ))

In this case (and with many other well-behaved models), the results of using the SigLevel step criterion are identical to those using AIC.

You can control the significance levels for adding and dropping models using the maxSigLevelToAdd and minSigLevelToDrop. The maxSigLevelToAdd specifies a significance level below which a variable can be considered for inclusion in the model; the minSigLevelToDrop specifies a significance level above which a variable currently included can be considered for dropping. By default, for method="stepwise", both the levels are set to 0.15. We can tighten the selection criterion to the 0.10 level as follows:

airlineStepModelSigLevel.10 <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime,
    data = trainingDataFile, variableSelection =
        rxStepControl( method = "stepwise", scope = ~ DayOfWeek +
            CRSDepTime + CRSElapsedTime +Distance + TaxiIn + TaxiOut,
            stepCriterion = "SigLevel",
            maxSigLevelToAdd=.10, minSigLevelToDrop=.10))

Plotting Model Coefficients

By default, the values of the parameters at each step of the stepwise selection are not preserved. Using an additional argument, keepStepCoefs, in your rxStepControl statement saves the values of the coefficients from each step of the regression. This coefficient data can then be plotted using another function, rxStepPlot.

Consider the stepwise linear regression on the iris data from section 8.8.3:

#  
# Plottings Model Coefficients at Each Step
#  
form <- Sepal.Length ~ Sepal.Width + Petal.Length
scope <- list(
    lower = ~ Sepal.Width,
    upper = ~ Sepal.Width + Petal.Length + Petal.Width * Species)

varsel <- rxStepControl(method = "stepwise", scope = scope, keepStepCoefs=TRUE)
rxlm.step <- rxLinMod(form, data = iris, variableSelection = varsel,
    verbose = 1, dropMain = FALSE, coefLabelStyle = "R")

Notice the addition of the argument keepStepCoefs = TRUE to the rxStepContol call. This produces an extra piece of output in the rxLinMod object, a dataframe containing the values of the coefficients at each step of the regression. This dataframe, stepCoefs, can be accessed as follows:

rxlm.step$stepCoefs
                               0         1          2          3
  (Intercept)          2.2491402 0.9962913  1.1477685  0.7228228
  Sepal.Width          0.5955247 0.4322172  0.4958889  0.5050955
  Petal.Length         0.4719200 0.7756295  0.8292439  0.8702840
  Petal.Width          0.0000000 0.0000000 -0.3151552 -0.2313888
  Species1             0.0000000 1.3940979  1.0234978  1.2715542
  Species2             0.0000000 0.4382856  0.2999359  1.0781752
  Species3             0.0000000        NA         NA         NA
  Petal.Width:Species1 0.0000000 0.0000000  0.0000000  0.2630978
  Petal.Width:Species2 0.0000000 0.0000000  0.0000000 -0.5012827
  Petal.Width:Species3 0.0000000 0.0000000  0.0000000         NA

Trying to glean patterns and information from a table can be difficult. So we’ve added another function, rxStepPlot, which allows the user to plot the parameter values at each step. Using the iris model object, we plot the coefficients:

rxStepPlot(rxlm.step)

From this plot, we can tell when a variable enters the model by noting the step when it becomes non-zero. Lines are labeled with the numbers on the right axis to indicate the parameter. The numbers correspond to the order they appear in the data frame stepCoef. You’ll notice that the 7th and 10th parameters don’t show up in this plot because the species3 parameter is the reference category for species.

The function rxStepPlot is easily customized by using additional graphical parameters from the matplot function from base R. In the following example, we update our original call to rxStepPlot to demonstrate how to specify the colors used for each line, adjust the line width, and add a proper title to the plot:

colorSelection <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
    "#0072B2", "#D55E00", "#CC79A7")
rxStepPlot(rxlm.step, col = colorSelection, lwd = 2,
    main = "Step Plot – Iris Coefficients")

By default, the rxStepPlot function uses seven line colors. If the number of parameters exceeds the number of colors, they are reused in the same order. However, the line types are set to vary from 1 to 5, so lines that have the same color may differ in line type. The line types can also be specified using the lty argument in the rxStepPlot call.

Fixed-Effects Models

Fixed-effects models are commonly associated with studies in which multiple observations are recorded for each test subject, for example, yearly observations of median housing price by city, or measurements of tensile strength from samples of steel rods by batch. To fit such a model with rxLinMod, include a factor variable specifying the subject (the cities, or the batch identifier) as the first predictor, and specify cube=TRUE to use a partitioned inverse and omit the intercept term.

For example, the MASS library contains the data set petrol, which consists of measurements of the yield of a certain refining process with possible predictors including specific gravity, vapor pressure, ASTM 10% point, and volatility measured as the ASTM endpoint for 10 samples of crude oil. Following Venables and Ripley , we first scale the numeric predictors, then fit the fixed-effects model:

#  Fixed-Effects Models

library(MASS)
Petrol <- petrol
Petrol[,2:5] <- scale(Petrol[,2:5], scale=F)
rxLinMod(Y ~ No + EP, data=Petrol,cube=TRUE)

  Call:
  rxLinMod(formula = Y ~ No + EP, data = Petrol, cube = TRUE)

  Cube Linear Regression Results for: Y ~ No + EP
  Data: Petrol
  Dependent variable(s): Y
  Total independent variables: 11
  Number of valid observations: 32
  Number of missing observations: 0

  Coefficients:
                Y
  No=A 32.5493917
  No=B 24.2746407
  No=C 27.7820456
  No=D 21.1541642
  No=E 21.5191269
  No=F 20.4355218
  No=G 15.0359067
  No=H 13.0630467
  No=I  9.8053871
  No=J  4.4360767
  EP    0.1587296

Least Squares Dummy Variable (LSDV) Models

RevoScaleR is capable of estimating huge models where fixed effects are estimated by dummy variables, that is, binary variables set to 1 or TRUE if the observation is in a particular category. Creation of these dummy variables is often accomplished by interacting two or more factor variables using “:” in the formula. If the first term in an rxLinMod (or rxLogit) model is purely categorical and the “cube” argument is set to TRUE, the estimation uses a partitioned inverse to save on computation time and memory.

A Quick Review of Interacting Factors#

First, let’s do a quick, cautionary review of interacting factor variables by experimenting with a small made-up data set.

#  Least Squares Dummy Variable (LSDV) Models
#   A Quick Review of Interacting Factors

set.seed(50)
income <- rep(c(1000,1500,2500,4000), each=5) + 100*rnorm(20)
region <- rep(c("Rural","Urban"), each=10)
sex <- rep(c("Female", "Male"), each=5)
sex <- c(sex,sex)
myData <- data.frame(income, region, sex)

The data set has 20 observations with three variables: a numeric variable income and two factor variables representing region and sex. There are three easy ways to compute the within group means of every combination of age and region using RevoScaleR. First, we can use rxSummary:

rxSummary(income~region:sex, data=myData)

which includes in its output:

 Category                            region sex    Means     StdDev   Min      
 income for region=Rural, sex=Female Rural  Female  970.7522 98.01983  827.2396
 income for region=Urban, sex=Female Urban  Female 2501.8303 47.10861 2450.1364
 income for region=Rural, sex=Male   Rural  Male   1480.4378 92.29320 1355.4250
 income for region=Urban, sex=Male   Urban  Male   3944.5127 40.82421 3883.3983

Second, we could use rxCube:

rxCube(income~region:sex, data=myData)

which includes in its output:

 region    sex    income Counts
1 Rural Female  970.7522      5
2 Urban Female 2501.8303      5
3 Rural   Male 1480.4378      5
4 Urban   Male 3944.5127      5

Or, we can use rxLinMod with cube=TRUE. The intercept is automatically omitted, and four dummy variables are created from the two factors: one for each combination of region and sex. The coefficients are simply the within group means:

summary(rxLinMod(income~region:sex, cube=TRUE, data=myData))

which includes in its output:

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)     | Counts
region=Rural, sex=Female   970.75      33.18   29.26 2.66e-15 *** |      5
region=Urban, sex=Female  2501.83      33.18   75.41 2.22e-16 *** |      5
region=Rural, sex=Male    1480.44      33.18   44.62 2.22e-16 *** |      5
region=Urban, sex=Male    3944.51      33.18  118.90 2.22e-16 *** |      5

The same model could be estimated using: lm(income~region:sex + 0, data=myData). Below we refer to these within group means as MeanIncRuralFemale, MeanIncUrbanFemale, MeanIncRuralMale, and MeanIncUrbanMale.

If we add an intercept, we encounter perfect multicollinearity and one of the coefficients will be dropped. The intercept is then the mean of a reference group and the other coefficients represent the differences or contrasts between the within group means and the reference group. For example,

lm(income~region:sex, data=myData)

produces:

Coefficients:
        (Intercept)  region:Rural:sexFemale  regionUrban:sexFemale  
               3945                -2974                -1443  
  regionRural:sexMale    regionUrban:sexMale  
              -2464                   NA  

We can see that the dummy variable for urban males was dropped; urban males are the reference group and the Intercept is equal to MeanIncUrbanMale. The other coefficients represent contrasts from the reference group, so for example, regionRural:sexFemale is equal to MeanIncRuralFemale – MeanIncUrbanMale.

Another variation is to use “*”in the formula for factor crossing. Using a\b* is equivalent to using a + b + a:b. For example:

lm(income~region*sex, data = myData)

which results in:

Coefficients:
      (Intercept)          regionUrban            sexMale  regionUrban:sexMale  
            970.8             1531.1              509.7              933.0

The dropping of coefficients in rxLinMod can be controlled to obtain the same results:

rxLinMod(income~region*sex, data = myData, dropFirst = TRUE, dropMain = FALSE)

Coefficients using this model can be more difficult to interpret, and in fact are highly dependent on the order of the factor levels. In this case, we see the following relationship between the estimated coefficients and the within group means

(Intercept) MeanIncRuralFemale
regionUrban MeanIncUrbanFemale – MeanIncRuralFemale
sexMale MeanIncRuralMale – MeanIncRuralFemale
regionUrban:sexMale MeanIncUrbanMale – MeanIncUrbanFemale - MeanIncRuralMale + MeanIncRuralFemale

If we set up our data slightly differently, we get different results. Let’s use the same income data but a different naming convention for the factors:

region <- rep(c("Rural","Urban"), each=10)
sex <- rep(c("Woman", "Man"), each=5)
sex <- c(sex,sex)
myData1 <- data.frame(income, region, sex)

Using the same model, lm(income~region\sex, data=myData1)*, results in:

Coefficients:
           (Intercept)           regionUrban                sexWoman  
                1480.4                  2464.1                  -509.7  
regionUrban:sexWoman  
                -933.0  

With this superficial modification to the data, the coefficient for the dummy variable for regionUrban has jumped from $1531 to $2464. This is due to the change in reference group; in this model the Urban coefficient represents the difference between mean urban and rural male income, while in the previous model it was the difference between mean urban and rural female income. That is, the relationship between the within group means and the new coefficients are:

(Intercept) MeanIncRuralMale
regionUrban MeanIncUrbanMale – MeanIncRuralMale
sexWoman MeanIncRuralFemale – MeanIncRuralMale
regionUrban:sexMale MeanIncUrbanFemale – MeanIncUrbanMale - MeanIncRuralFemale + MeanIncRuralMale

Omitting the intercept provides yet another combination of results. For example, using rxLinMod with the original factor labeling:

rxLinMod(income~region*sex, data=myData, cube=TRUE)

results in:

Call:
rxLinMod(formula = income ~ region * sex, data = myData, cube = TRUE)

Cube Linear Regression Results for: income ~ region * sex
Data: myData
Dependent variable(s): income
Total independent variables: 8 (Including number dropped: 4)
Number of valid observations: 20
Number of missing observations: 0

Coefficients:
                           income
region=Rural            1480.4378
region=Urban            3944.5127
sex=Female             -1442.6824
sex=Male                  Dropped
region=Rural, sex=Female 932.9968
region=Urban, sex=Female  Dropped
region=Rural, sex=Male    Dropped
region=Urban, sex=Male    Dropped
region=Rural MeanIncRuralMale
region=Urban MeanIncUrbanMale
Sex=Female MeanIncUrbanFemale – MeanIncUrbanMale
region=Rural, sex=Female (MeanIncRuralFemale – MeanIncRuralMale) - (MeanIncUrbanFemale – MeanIncUrbanMale)

With large data sets it is common to estimate many interaction terms, and if some categories have zero counts, it may not even be obvious what the reference group is. Also note that setting cube=TRUE in the preceding model is of limited use: only the first term from the expanded expression (in this case region) is estimated using a partitioned inverse.

Using Dummy Variables in rxLinMod: Letting the Data Speak Example 2

In previous articles, we looked at the CensusWorkers.xdf data set and examined the relationship between wage income and age. Now let’s add another variable, and examine the relationship between wage income and sex and age.

We can start with a simple dummy variable model, computing the mean wage income by sex:

#   Using Dummy Variables in rxLinMod

censusWorkers <- file.path(rxGetOption("sampleDataDir"), "CensusWorkers.xdf")
rxLinMod(incwage~sex, data=censusWorkers, pweights="perwt", cube=TRUE)

which computes:

Call:
rxLinMod(formula = incwage ~ sex, data = censusWorkers, pweights = "perwt",
    cube = TRUE)

Cube Linear Regression Results for: incwage ~ sex
File name:
    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
Probability weights: perwt
Dependent variable(s): incwage
Total independent variables: 2
Number of valid observations: 351121
Number of missing observations: 0

Coefficients:
            incwage
sex=Male   43472.71
sex=Female 26721.09

Similarly, we could look at a simple linear relationship between wage income and age:

linMod1 <- rxLinMod(incwage~age, data=censusWorkers, pweights="perwt")
summary(linMod1)

resulting in:

Call:
rxLinMod(formula = incwage ~ age, data = censusWorkers, pweights = "perwt")

Linear Regression Results for: incwage ~ age
File name:
    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
Probability weights: perwt
Dependent variable(s): incwage
Total independent variables: 2
Number of valid observations: 351121
Number of missing observations: 0

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12802.980    247.963   51.63 2.22e-16 ***
age           572.980      5.947   96.35 2.22e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 180800 on 351119 degrees of freedom
Multiple R-squared: 0.02576
Adjusted R-squared: 0.02576
F-statistic:  9284 on 1 and 351119 DF,  p-value: < 2.2e-16
Condition number: 1

Computing the two end points on the regression line, we can plot it:

age <- c(20,65)
coefLinMod1 <- coef(linMod1)
incwage_Pred <- coefLinMod1[1] + age*coefLinMod1[2]
plotData1 <- data.frame(age, incwage_Pred)
rxLinePlot(incwage_Pred~age, data=plotData1)

The next typical step is to combine the two approaches by estimating separate intercepts for males and females:

linMod2 <- rxLinMod(incwage~sex+age, data = censusWorkers, pweights = "perwt",
cube=TRUE)
summary(linMod2)

which results in:

Call:
rxLinMod(formula = incwage ~ sex + age, data = censusWorkers,
    pweights = "perwt", cube = TRUE)

Cube Linear Regression Results for: incwage ~ sex + age
File name:
C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
Probability weights: perwt
Dependent variable(s): incwage
Total independent variables: 3
Number of valid observations: 351121
Number of missing observations: 0

Coefficients:
            Estimate Std. Error t value Pr(>|t|)     |  Counts
sex=Male   20479.178    250.033   81.91 2.22e-16 *** | 3866542
sex=Female  3723.067    252.968   14.72 2.22e-16 *** | 3276744
age          573.232      5.816   98.56 2.22e-16 *** |        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 176800 on 351118 degrees of freedom
Multiple R-squared: 0.06804 (as if intercept included)
Adjusted R-squared: 0.06804
F-statistic: 1.282e+04 on 2 and 351118 DF,  p-value: < 2.2e-16
Condition number: 1

We create a small sample data set with the same variables we use in censusWorkers, but with only four observations representing the high and low value of age for both sexes. Using the rxPredict function, the predicted values for each one of these sample observations is computed, which we then plot:

age <- c(20,65,20,65)
sex <- factor(rep(c(1, 2), each=2), labels=c("Male", "Female"))
perwt <- rep(1, times=4)
incwage <- rep(0, times=4)
plotData2 <- data.frame(age, sex, perwt, incwage)
plotData2p <- rxPredict(linMod2, data=plotData2, outData=plotData2)
rxLinePlot(incwage_Pred~age, groups=sex, data=plotData2p)

These types of models are often relaxed further by allowing both the slope and intercept to vary by group:

linMod3 <- rxLinMod(incwage~sex+sex:age, data = censusWorkers,
    pweights = "perwt", cube=TRUE)
summary(linMod3)
Call:
rxLinMod(formula = incwage ~ sex + sex:age, data = censusWorkers,
    pweights = "perwt", cube = TRUE)

Cube Linear Regression Results for: incwage ~ sex + sex:age
File name:
    C:\Program Files\Microsoft\MRO-for-RRE\8.0\R-3.2.2\library\RevoScaleR\SampleData\CensusWorkers.xdf
Probability weights: perwt
Dependent variable(s): incwage
Total independent variables: 4
Number of valid observations: 351121
Number of missing observations: 0

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)     |  Counts
sex=Male           10783.449    328.871   32.79 2.22e-16 *** | 3866542
sex=Female         15131.422    356.806   42.41 2.22e-16 *** | 3276744
age for sex=Male     814.948      7.888  103.31 2.22e-16 *** |        
age for sex=Female   288.876      8.556   33.76 2.22e-16 *** |        
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 176300 on 351117 degrees of freedom
Multiple R-squared: 0.07343 (as if intercept included)
Adjusted R-squared: 0.07343
F-statistic:  9276 on 3 and 351117 DF,  p-value: < 2.2e-16
Condition number: 1.1764

Again getting predictions and plotting:

plotData3p <- rxPredict(linMod3, data=plotData2, outData=plotData2)
rxLinePlot(incwage_Pred~age, groups=sex, data=plotData3p)

We could continue the process, experimenting with functional forms for age. But, since we have many observations (and therefore many degrees of freedom), we can take advantage of the F() function available in revoScaleR to let the data speak for itself. The F() function creates a factor variable from a numeric variable “on-the-fly”, creating a level for every integer value. This allows us to compute and observe the shape of the functional form using a purely dummy variable model:

linMod4 <- rxLinMod(incwage~sex:F(age), data=censusWorkers, pweights="perwt",
cube=TRUE)

This model estimated a total of 92 coefficients, all for dummy variables representing every age in the range of 20 to 65 for each sex. To visually examine the coefficients we could add observations to our plotData data frame and use rxPredict to compute predicted values for each of the 92 groups represented, but since the model contains only dummy variables in an initial cube term, we can instead use the “counts” data frame returned with the rxLinMod object:

plotData4 <- linMod4$countDF
# Convert the age factor variable back to an integer
plotData4$age <- as.integer(levels(plotData4$F.age.))[plotData4$F.age.]
rxLinePlot(incwage~age, groups=sex, data=plotData4)

Intercept-Only Models

You may have seen intercept-only models fitted with R’s lm function, where the model formula is of the form response ~ 1. In RevoScaleR these models should be fitted using rxSummary, because the intercept-only model simply returns the mean of the response. For example:

airlineDF <- rxDataStep(inData =
    file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.xdf"))
lm(ArrDelay ~ 1, data = airlineDF)
  Call:
  lm(formula = ArrDelay ~ 1, data = airlineDF)

  Coefficients:
  (Intercept)  
        11.32  

rxSummary(~ ArrDelay, data = airlineDF)
  Call:
  rxSummary(formula = ~ArrDelay, data = airlineDF)

  Summary Statistics Results for: ~ArrDelay
  Data: airlineDF
  Number of valid observations: 6e+05
  Number of missing observations: 0

     Name     Mean     StdDev   Min Max  ValidObs MissingObs
     ArrDelay 11.31794 40.68854 -86 1490 582628   17372