rxFormula: Formula Syntax for RevoScaleR Analysis Functions
Highlights of the similarities and differences in formulas between RevoScaleR and standard R functions.
The formula syntax used by the RevoScaleR analysis functions is similar, but not identical, to regular R formula syntax. The most important differences are:
With the exception of rxSummary, dot (
explanatory variable expansion is not supported.
* Multiple column producing in-line variable transformations, e.g. poly and bs, are not supported.
The original order of the explanatory variables are maintained, i.e.
the main effects are not forced to precede the interaction terms. (See
keep.order = TRUE setting in terms.formula for more
A formula typically consists of a response, which in most
RevoScaleR functions can be a single variable or multiple variables
combined using cbind, the
"~" operator, and one or
more predictors,typically separated by the
The rxSummary function typically requires a formula with no
Interactions are indicated using the
":" operator. The interaction of
two categorical variables results in a categorical variable containing the
full set of combinations of the two categories and adds a coefficient to the
model for each category. The interaction of two continuous variables is the
same as the multiplication of the two variables. The interaction of a
continuous and a categorical variable adds a coefficient for the continuous
variable for each level of the categorical variable. The asterisk operator
"*" between categorical variables adds all subsets of interactions of
the variables to the model.
In RevoScaleR, predictors must be single-column variables.
RevoScaleR formulas support two formula functions for managing categorical variables:
F(x, low, high, exclude)creates a categorical variable out of continuous variable
x. Additional arguments
excludecan be included to specify the value of the lowest category, the highest category, and how to handle values outside the specified range. For each integer in the range from
highinclusive, RevoScaleR creates a level and assigns values greater than or equal to an integer n, but less than n+1, to n's level. If
xis already a factor,
F(x, low, high, exclude)can be used to limit the range of levels used; in this case
highrepresent the indexes of the factor levels, and must be integers in the range from 1 to the number of levels.
N(x)creates a continuous variable from categorical variable
x. Note, however, that the value of this function is equivalent to the factor codes, and has no relation to any numeric values within the levels of the function. For this, use the construction
Microsoft Technical Support
# These two lines are set up for the examples sampleDataDir <- rxGetOption("sampleDataDir") censusWorkers <- file.path(sampleDataDir, "CensusWorkers.xdf") rxSummary(~ F(age) + sex, data = censusWorkers) rxSummary(~ F(age, low = 30, high = 45, exclude = FALSE), data = censusWorkers) rxCube(incwage ~ F(age) : sex, data = censusWorkers)