RevoScaleR package

The RevoScaleR library provides a set of over one hundred portable, scalable, and distributable data analysis R functions that run on the RevoScaleR interpreter built on open source R and extended to accommodate high performance computing (HPC) and analysis (HPA).

HPA algorithms include descriptive statistics, cross-tabulations, linear regression, covariance and correlation matrices, logistic regression, generalized linear models, k-means clustering, classification and regression trees, and decision forests. HPC functionality is enabled on Hadoop processing frameworks (Spark and MapReduce) for distributed execution of essentially any R function across cores and nodes, delivering the results back to the user.

Package details
Version: 9.2.1
Runs on: Machine Learning Server 9.2.1
Microsoft R Client (Windows and Linux)
Microsoft R Server 9.1 and earlier
SQL Server 2016 and later (Windows only)
Azure HDInsight
Azure Data Science Virtual Machines
Built on: R 3.3.x (included when you install a product that provides this package).

How to use RevoScaleR

The RevoScaleR library is installed in all Microsoft R products. You can use any R IDE to write R script calling functions in RevoScaleR, but the script must run on a computer having Microsoft R.

RevoScaleR is often preloaded into tools that integrate with R Server, which means you can call functions without having to load the library.

If the library is not loaded, you can load RevoScaleR from the command line by typing library(RevoScaleR).

Some functions in RevoScaleR are specific to particular compute contexts. A filtered list of functions include the following:

Note

Some function names begin with rx and others with Rx. The Rx function name prefix is used to distinguish the class constructors such as data sources and compute contexts.

Functions by category

This section lists the functions by category to give you an idea of how each one is used. You can also use the table of contents to find functions in alphabetical order.

1-Data analysis functions


                import

Whenever you want to perform an analysis using RevoScaleR functions, you should specify three distinct pieces of information:

Import and export functions

Function name Description
rxImport * Creates an .xdf file or data frame from a data source (e.g. text, SAS, SPSS data files, ODBC or Teradata connection, or data frame).
rxDataStep * Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output data) from an .xdf file or a data frame.
rxGetInfo * Retrieves summary information from a data source or data frame.
rxSetInfo * Sets a file description in an .xdf file or a description attribute in a data frame.
rxGetVarInfo Retrieves variable information from a data source or data frame.
rxSetVarInfo Modifies variable information in an .xdf file or data frame.
rxGetVarNames Retrieves variable names from a data source or data frame.
rxCreateColInfo Generates a colInfo list from a data source.
rxCompressXdf Compresses an existing .xdf file, or a directory of .xdf files.
RxXdfData Creates an efficient XDF data source object.
RxTextData Creates a comma delimited text data source object.
RxSasData Creates a SAS data source object.
RxSpssData Creates a SPSS data source object.
RxOdbcData Creates a ODBC data source object.
RxTeradata Creates a Teradata data source object.
RxSqlServerData Creates a SQL Server data source object
rxOpen Opens a data source for reading.
rxClose Closes a data source.
rxReadNext Read data from a source.
rxSetFileSystem Specify a file system type for data for import.
rxGetFileSystem Retrieve the current file system type.
rxHdfsFileSystem Creates an HDFS file system object.
rxNativeFileSystem Creates a native file system object.

* Signifies the most popular functions in this category.

Data transformation functions

Function name Description
rxDataStep * Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output) from an .xdf file or a data frame.
rxFactors * Recode a factor variable or convert non-factor variable into a factor in an .xdf file or data frame.
rxGetFuzzyDist Get fuzzy distances for a character vector.
rxGetFuzzyKeys Get fuzzy keys for a character vector.
rxSplit Splits an .xdf file or data frame into multiple .xdf files or data frames.
rxSort Multi-key sorting of the variables an .xdf file or data frame.
rxMerge Merges two .xdf files or data frames using a variety of merge types.
rxExecuteSQLDDL SQL Server R Services only. Runs an arbitrary SQL DDL command.

* Signifies the most popular functions in this category.

Basic graphing functions

Function name Description
rxHistogram Creates a histogram from data.
rxLinePlot Creates a line plot from data.
rxLorenz Computes a Lorenz curve which can be plotted.
rxRocCurve Computes and plots ROC curves from actual and predicted data.

Descriptive statistics and cross-tabulation

Function name Description
rxQuantile * Computes approximate quantiles for .xdf files and data frames without sorting.
rxSummary * Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported.
rxCrossTabs * Formula-based cross-tabulation of data.
rxCube * Alternative formula-based cross-tabulation designed for efficient representation returning cube results. Writing output to .xdf file not supported.
rxMarginals Marginal summaries of cross-tabulations.
as.xtabs Converts cross tabulation results to an xtabs object.
rxChiSquaredTest Performs Chi-squared Test on xtabs object. Used with small data sets and does not chunk data.
rxFisherTest Performs Fisher's Exact Test on xtabs object. Used with small data sets and does not chunk data.
rxKendallCor Computes Kendall's Tau Rank Correlation Coefficient using xtabs object.
rxPairwiseCrossTab Apply a function to pairwise combinations of rows and columns of an xtabs object.
rxRiskRatio Calculate the relative risk on a two-by-two xtabs object.
rxOddsRatio Calculate the odds ratio on a two-by-two xtabs object.

* Signifies the most popular functions in this category.

Prediction functions for statistical modeling

Function name Description
rxLinMod * Fits a linear model to data.
rxLogit * Fits a logistic regression model to data.
rxGlm * Fits a generalized linear model to data.
rxCovCor * Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables.
rxDTree * Fits a classification or regression tree to data.
rxBTrees * Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm.
rxDForest * Fits a classification or regression decision forest to data.
rxPredict * Calculates predictions for fitted models. Output must be an XDF data source.
rxKmeans * Performs k-means clustering.
rxNaiveBayes * Performs Naive Bayes classification.
rxCov Calculate the covariance matrix for a set of variables.
rxCor Calculate the correlation matrix for a set of variables.
rxSSCP Calculate the sum of squares / cross-product matrix for a set of variables.
rxRoc Receiver Operating Characteristic (ROC) computations using actual and predicted values from binary classifier system.

* Signifies the most popular functions in this category.

2-Compute context functions

Function name Description
rxSetComputeContext Sets a compute context.
rxGetComputeContext Gets the current compute context.
RxHadoopMR Creates an in-data, file-based Hadoop compute context.
RxSpark Creates an in-data, file-based Spark compute context. Computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark.
RxInTeradata < Creates an in-database compute context for Teradata.
RxInSqlServer Creates an in-database compute context for SQL Server.
RxComputeContext Creates a compute context.
RxLocalSeq Creates a local compute context for rxExec using sequential computations.
RxLocalParallel Creates a local compute context for rxExec using the **parallel* package as backend.
RxForeachDoPar Creates a compute context for rxExec using the current foreach parallel backend.
rxInstalledPackages Returns the list of installed packages for a compute context.
rxFindPackage Returns the path to one or more packages for a compute context.

3-Data source functions

Function name Description
RxXdfData Creates an efficient XDF data source object.
RxTextData Creates a comma delimited text data source object.
RxSasData Creates a SAS data source object.
RxSpssData Creates a SPSS data source object.
RxOdbcData Creates a ODBC data source object.
RxTeradata Creates a Teradata data source object.
RxSqlServerData Creates a SQL Server data source object.

4-HPC and distributed computing functions

These functions and many more can be used for high performance computing and distributed computing. Learn more about the entire set of functions in the Distributed Computing guide.

Function name Description
rxExec Run an arbitrary R function on nodes or cores of a cluster.
[rxRngNewStream]((rxrng.md) Support for Parallel Random Number Generation.
rxRngDelStream Support for Parallel Random Number Generation.
rxRngGetStream Support for Parallel Random Number Generation.
rxRngSetStream Support for Parallel Random Number Generation.
rxGetAvailableNodes Get all the available nodes on a distributed compute context.
rxGetNodeInfo Get information on nodes specified for a distributed compute context.
rxPingNodes Test round trip from user through computation node(s) in a cluster or cloud.
rxGetJobStatus Get the status of a non-waiting distributed computing job.
rxGetJobResults Get the return object(s) of a non-waiting distributed computing job.
rxGetJobOutput Get the console output from a non-waiting distributed computing job.
rxGetJobs Get the available distributed computing job information objects.
rxLocateFile Get the first occurrence of a specified input file in a set of specified paths.

5-Utility functions

Some of the utility functions are operational in local compute context only. Check the documentation of individual functions to confirm.

Function name Description
rxOptions Gets or sets a specific option.
rxGetOption Retrieves a specific RevoScaleR option.
rxGetEnableThreadPool Gets the current state of the thread pool, which on Linux can be either persistent or on-demand.
rxSetEnableThreadPool Sets the thread pool state.
rxStepControl Construct a variable.selection argument for rxLinMod.
rxIsOpen Indicates whether a data source can be accessed.
rxSqlServerDropTable Execute an SQL statement that drops a table.
rxSqlServerTableExists Execute an SQL statement that checks for a table's existance.
rxWriteNext Writes the next chunk when moving data between RevoScaleR data sources.

Next steps

Add R packages to your computer by running setup for R Server or R Client:

Next, follow these tutorials for hands on experience::

See also

R Package Reference