rxFactors: Factor Variable Recoding
Recodes a factor variable by mapping one set of factor levels and indices to a new set. Can also be used to convert non-factor variable into a factor.
rxFactors(inData, factorInfo, sortLevels = FALSE, otherLevel = NULL, outFile = NULL, varsToKeep = NULL, varsToDrop = NULL, overwrite = FALSE, maxRowsByCols = NULL, blocksPerRead = rxGetOption("blocksPerRead"), reportProgress = rxGetOption("reportProgress"), verbose = 0, xdfCompressionLevel = rxGetOption("xdfCompressionLevel"), ...)
either an RxXdfData object, a character string specifying the .xdf file, or a data frame.
character vector of variable names,a list of named variable information lists, or empty or
sortLevels is set to
TRUE, the levels of the variables named in the character vector will all be sorted; if
factorInfo is empty or
NULL, all factors will be sorted. If a
factorInfo list is provided, each variable information list contains one or more of the named elements given below.
Currently available properties for a column information list are:
levels- optional vector, containing values to match in converting non-factor data to factor levels. If
levels = NULL, all of the unique values in the data are converted to levels. in the order encountered. However, the user can override this behavior and sort the resulting levels alphabetically by setting
sortLevels = TRUE. The user may also specify a subset of the data to convert to levels. In this case, if
otherLevel = NULL, all data values not found in the
levelssubset will be converted to missing (
NA) values. For example, if a variable
xis comprised of integer data
factorInfo = list(x = list(levels = 2:4, otherLevel = NULL))
xinto a factor with data
"4". Alternatively, the user may wish to place all of those unspecified values into a single category, say
"other". In that case, use
otherLevel = "other"along with the subset
levelsspecification. Note that the
levelsvector may be any type, e.g., 'integer', 'numeric', 'character'. However, behind the scenes, it is always converted to type 'character', as are the data values being converted. The resulting strings are matched with those of the data to populate the categories.
otherLevel- character string defining the level to assign to all factor values that are not listed in the
newLevelsis specified. If
otherLevel = NULL, the default, the factor levels that are not listed in
newLevelswill be left unchanged and in their original order. If specified, the value set here overrides the default argument of the same in the primary argument list.
sortLevels- logical scalar. If
TRUE, the resulting levels will be sorted alphabetically. If the input variable is not a factor and levels are not specified, this will be ignored and levels will be in the order in which they are encountred.
varName- character string defining the name of an existing data variable to recode. If this field is left unspecified, then the name of the corresponding list element in
factorInfowill be used. For example, all of the following are acceptable and equivalent
factorInfospecifications for alphabetically sorting the levels of an existing factor variable named
factorInfo = list( myFacVar = list( sortLevels = TRUE ) )
factorInfo = list( list( sortLevels = TRUE, varName = "myFacVar" ) )
factorInfo = list( myFacVar = list( sortLevels = TRUE, varName = "myFacVar" ) )
However, if you wish to rename a variable after conversion (keeping the old variable in tact), there is only one acceptable format: the variable to be recoded must appear in the
varNamefield while the new variable name for the converted data must appear as the name of the corresponding list element. For example, to sort the levels of an existing factor variable
"myFacVar"and store the result in a new variable
"myNewVar", you would issue:
factorInfo = list( myNewFacVar = list( sortLevels = TRUE, varName = "myFacVar" ) )
newLevels- a character vector or list, possibly with named elements, used to rename the levels of a factor. While
levelsprovides a means of filtering the data the user wishes to import and convert to factor levels,
newLevelsis used to alter converted or existing levels by renaming, collapsing, or sorting them. See the Examples section below for typical use cases of
description- character string defining a description for the recoded variable.
the default value to use for the
sortLevels field in the
the default value to use for the
otherLevel field in the
either an RxXdfData object, a character string specifying the .xdf file, or
outFile = NULL, a data frame is returned. When writing to HDFS,
outFile must be an
RxXdfData object representing a new composite XDF.
character vector of variable names to include in the data file. If
NULL, argument is ignored. Cannot be used with
character vector of variable names to not include in the data file. If
NULL, argument is ignored. Cannot be used with
logical value. If
TRUE, an existing
outFile will be overwritten. Ignored if a dataframe is returned.
this argument is used only when
inData is referring to an .xdf file (character string defining a path to an existing .xdf file or an RxXdfData object) and we wish to return the output as a data frame (
outFile = NULL). In this case, and behind the scenes, the output is written to a temporary .xdf file and
rxDataStep is subsequently called to convert the output into a data frame. The
maxRowsByCols argument is passed directly in the
rxDataStep call, giving the user some control over the conversion. See rxDataStep for more details on the
number of blocks to read for each chunk of data read from the data source. If the
outFile are the same file, blocksPerRead must be 1.
integer value with options:
0: no progress is reported.
1: the number of processed rows is printed and updated.
2: rows processed and timings are reported.
3: rows processed and all timings are reported.
integer value. If
0, no additional output is printed. If
1, additional summary information is printed.
integer in the range of -1 to 9. The higher the value, the greater the amount of compression - resulting in smaller files but a longer time to create them. If
xdfCompressionLevel is set to 0, there will be no compression and files will be compatible with the 6.0 release of Revolution R Enterprise. If set to -1, a default level of compression will be used.
additional arguments to be passed directly to the Microsoft R Services Compute Engine.
Factors are variables that represent categories. An example is a variable named
whose values are the levels
There are two parts to a factor variable:
a vector of N (number of observations) integer indexes
with values in the range of
1:K, where K is the number of categories.
1 a vector of K strings (characters) that are used when the vector is displayed and in some other situations.
For instance, when state levels are alphabetical, all observations for which
state == "Alabama"
will have the index
state == "Washington" values correspond to index
and so on.
Recoding a factor means changing from one set of indices to another. For instance, if the levels
"state" are currently arranged in the order in which they were encountered when importing a .csv
file, and it is desired to put them in alphabetical order, then it is necessary to change the index for every
If numeric data is converted to a factor, a maximum precision of 6 is used. So, for example, the values 7.123456 and 7.12346 would be placed in the same category.
To recode a categorical or factor variable into a continuous variable within a
N(). To recode continuous variable to a categorical or factor
variable within a formula use
F(). See rxFormula.
To rename the levels of a factor variable in an .xdf file (without change the levels themselves), use rxSetVarInfoXdf.
NULL, then a data frame is returned. Otherwise, the results
are written to the specified
outFile file and an RxXdfData object is returned
invisibly corresponding to the output file.
Microsoft Technical Support
### # Example 1: Recoding levels in alphabetical order ### # Use the 'warpbreaks' data frame found in the 'datasets' package # Note that the 'tension' factor variable has levels that are not # alphabetically ordered. rxGetVarInfo( warpbreaks ) # Reorder all factor levels that are not in alphabetical order recodedDF1 <- rxFactors(inData = warpbreaks, sortLevels = TRUE) rxGetVarInfo( recodedDF1 ) # Specify that only 'tension' levels should be reordered alphabetically recodedDF2 <- rxFactors(inData = warpbreaks, sortLevels = TRUE, factorInfo = c("tension")) rxGetVarInfo( recodedDF2 ) # Specify that only 'tension' levels should be reordered alphabetically using a list recodedDF3 <- rxFactors(inData = warpbreaks, factorInfo = list(tension = list(sortLevels = TRUE))) rxGetVarInfo( recodedDF3 ) # write data frame to .xdf file and perform similar recoding # but write the recoded factor to a new variable. Compare the # original with the recoded factor. inXDF <- file.path(tempdir(), "warpbreaks.xdf") outXDF <- file.path(tempdir(), "warpbreaksRecoded.xdf") rxDataStep(warpbreaks, outFile = inXDF, overwrite = TRUE) outDS <- rxFactors(inData = inXDF, outFile = outXDF, overwrite = TRUE, factorInfo = list(recodedTension = list(sortLevels = TRUE, varName = "tension"))) DF <- rxDataStep(outDS) rxGetVarInfo( DF ) # clean up if (file.exists(inXDF)) unlink(inXDF) if (file.exists(outXDF)) unlink(outXDF) ### # Example 2: Recoding levels and indexes, saving recoding to a new factor variable ### # Create an .xdf file with a factor variable named 'sex' with levels 'M and 'F' set.seed(100) sex <- factor(sample(c("M","F"), size = 10, replace = TRUE), levels = c("M", "F")) DF <- data.frame(sex = sex, score = rnorm(10)) DF[["sex"]] XDF <- file.path(tempdir(), "sex.xdf") XDF2 <- file.path(tempdir(), "newSex.xdf") rxDataStep(DF, outFile = XDF, overwrite = TRUE) # Assume that we change our minds and now wish to # rename the levels to "Female" and "Male" # Let us do the recoding and store the result into a new # variable named "Gender" keeping the old variable in place. outDS <- rxFactors(inData = XDF, outFile = XDF2, overwrite = TRUE, factorInfo = list(Gender = list(newLevels = c(Female = "F", Male = "M"), varName = "sex"))) newDF <- rxDataStep(outDS) print(newDF) # clean up if (file.exists(XDF)) unlink(XDF) if (file.exists(XDF)) unlink(XDF2) ### # Example 3: Combining subsets of factor levels into single levels ### # Create a data set that contains a factor variable 'Month' # Note that the levels are not in alphabetical order. set.seed(100) DF <- data.frame(Month = factor(sample(month.name, size = 20, replace = TRUE), levels = rev(month.name))) # Recode the months into quarters and store result into new variable named "Quarter" recodedDF <- rxFactors(inData = DF, factorInfo = list(Quarter = list(newLevels = list(Q1 = month.name[1:3], Q2 = month.name[4:6], Q3 = month.name[7:9], Q4 = month.name[10:12]), varName = "Month"))) head(recodedDF) recodedDF$Quarter ### # Example 4: Coding and recoding combinations using a single factorInfo list ### set.seed(100) size <- 10 months <- factor(sample(month.name, size = size, replace = TRUE), levels = rev(month.name)) states <- factor(sample(state.name, size = size, replace = TRUE), levels = state.name) animalFarm <- c("cow","horse","pig","goat","chicken", "dog", "cat") animals <- factor(sample(animalFarm, size = size, replace = TRUE), levels = animalFarm) values <- sample.int(100, size = size, replace = TRUE) dblValues <- c(1, 2.1, 3.12, 4.123, 5.1234, 6.12345, 7.123456, 7.12346, 81234.56789, 91234567.8) DF <- data.frame(Month = months, State = states, Animal = animals, VarInt = values, VarDbl = dblValues, NotUsed1 = seq(size), NotUsed2 = rev(seq(size))) factorInfo <- list( # Convert months to quarters Quarter = list(newLevels = list(Q1 = month.name[1:3], Q2 = month.name[4:6], Q3 = month.name[7:9], Q4 = month.name[10:12]), varName = "Month"), # Sort animal levels Animal = list(sortLevels = TRUE), # Convert integer data to factor and do not sort levels VarIntFac = list(varName = "VarInt", sortLevels = FALSE), # Convert double data to factor; it will use a precision up to 6 VarDblFac = list(varName = "VarDbl"), # In-place arbitrary grouping of state names using indexMap StateSide = list(newLevels = c(LeftState = "1", RightState = "2"), indexMap = c(rep(1, 25), rep(2, 25)), varName = "State") ) rxFactors(DF, factorInfo) ### # Example 5: Using 'newLevels' to rename, reorder, or collapse existing factor levels. # All of these examples make use of the iris data set, which contains levels # "setosa", "versicolor", and "virginica", in that order. ### # Renaming factor levels: # # "setosa" to "Seto" # "versicolor" to "Vers" # "virginica" to "Virg" newLevels <- list(Seto = "setosa", Vers = "versicolor", Virg = "virginica") rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species # Reordering: newLevels <- c("versicolor", "setosa", "virginica") rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species # Collapsing: order does matter here, so the resulting order of the levels will # be "V" then "S". The 'sortLevels' argument is a quick means of alphabetically # sorting the resultant level names. newLevels <- list(V = "setosa", S = c("versicolor", "virginica")) rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels, sortLevels = TRUE)))$Species # Subset collapsing with renaming: accomplish with the use of 'otherLevel' newLevels <- list(S = "setosa") rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels, otherLevel = "otherSpecies")))$Species # Superset specification: adding new species for a future study newLevels <- c("setosa", "versicolor", "virginica", "pumila", "narbuti", "camillae") rxFactors(iris, factorInfo = list(Species = list(newLevels = newLevels)))$Species