RevoScaleR package

The RevoScaleR library is a collection of portable, scalable, and distributable R functions for importing, transforming, and analyzing data at scale. You can use it for descriptive statistics, generalized linear models, k-means clustering, logistic regression, classification and regression trees, and decision forests.

Functions run on the RevoScaleR interpreter, built on open source R, engineered to leverage the multithreaded and multinode architecture of the host platform.

Package details
Version: 9.2.1
Runs on: Machine Learning Server 9.2.1
R Client (Windows and Linux)
R Server 9.1 and earlier
SQL Server 2016 and later (Windows only)
Azure HDInsight
Azure Data Science Virtual Machines
Built on: R 3.3.x (included when you install a product that provides this package).

How to use RevoScaleR

The RevoScaleR library is found in Machine Learning Server and Microsoft R products. You can use any R IDE to write R script calling functions in RevoScaleR, but the script must run on a computer having the interpreter and libraries.

RevoScaleR is often preloaded into tools that integrate with Machine Learning Server and R Client, which means you can call functions without having to load the library. If the library is not loaded, you can load RevoScaleR from the command line by typing library(RevoScaleR).

Run it locally

This is the default. RevoScaleR runs locally on all platforms, including R Client. On a standalone Linux or windows system, data and operations are local to the machine. On Hadoop, a local compute context means that data and operations are local to current execution environment (typically, an edge node).

Run in a remote compute context

RevoScaleR runs remotely on computers that have a server installation. In a remote compute context, the script running on a local R Client or Machine Learning Server shifts execution to a remote Machine Learning Server. For example, script running on Windows might shift execution to a Spark cluster to process data there.

On distributed platforms, such as Hadoop processing frameworks (Spark and MapReduce), set the compute context to RxSpark or RxHadoopMR and give the cluster name. In this context, if you call a function that can run in parallel, the task is distributed across data nodes in the cluster, where the operation is co-located with the data.

On SQL Server, set the compute context to RxInSQLServer. There are two primary use cases for remote compute context:

  • Call R functions in T-SQL script or stored procedures running on SQL Server.

  • Call RevoScaleR functions in R script executing in a SQL Server compute context. In your script, you can set a compute context to shift execution of RevoScaleR operations to a remote SQL Server instance that has the RevoScaleR interpreter.

Some functions in RevoScaleR are specific to particular compute contexts. A filtered list of functions include the following:

Typical workflow


                import

Whenever you want to perform an analysis using RevoScaleR functions, you should specify three distinct pieces of information:

Functions by category

The library includes data transformation and manipulation, visualization, predictions, and statistical analysis functions. It also includes functions for controlling jobs, serializing data, and performing common utility tasks.

This section lists the functions by category to give you an idea of how each one is used. The table of contents lists functions in alphabetical order.

Note

Some function names begin with rx and others with Rx. The Rx function name prefix is used for class constructors for data sources and compute contexts.

1-Data source functions

Function name Description
RxXdfData Creates an efficient XDF data source object.
RxTextData Creates a comma-delimited text data source object.
RxSasData Creates a SAS data source object.
RxSpssData Creates an SPSS data source object.
RxOdbcData Creates an ODBC data source object.
RxTeradata Creates a Teradata data source object.
RxSqlServerData Creates a SQL Server data source object.

2-Import and save-as

Function name Description
rxImport * Creates an .xdf file or data frame from a data source (for example, text, SAS, SPSS data files, ODBC or Teradata connection, or data frame).
rxDataStep * Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output data) from an .xdf file or a data frame.
rxGetInfo * Retrieves summary information from a data source or data frame.
rxSetInfo * Sets a file description in an .xdf file or a description attribute in a data frame.
rxGetVarInfo Retrieves variable information from a data source or data frame.
rxSetVarInfo Modifies variable information in an .xdf file or data frame.
rxGetVarNames Retrieves variable names from a data source or data frame.
rxCreateColInfo Generates a colInfo list from a data source.
rxCompressXdf Compresses an existing .xdf file, or a directory of .xdf files.
RxXdfData Creates an efficient XDF data source object.
RxTextData Creates a comma-delimited text data source object.
RxSasData Creates a SAS data source object.
RxSpssData Creates an SPSS data source object.
RxOdbcData Creates an ODBC data source object.
RxTeradata Creates a Teradata data source object.
RxSqlServerData Creates a SQL Server data source object
rxOpen Opens a data source for reading.
rxClose Closes a data source.
rxReadNext Read data from a source.
rxSetFileSystem Specify a file system type for data for import.
rxGetFileSystem Retrieve the current file system type.
rxHdfsFileSystem Creates an HDFS file system object.
rxNativeFileSystem Creates a native file system object.

* Signifies the most popular functions in this category.

3-Data transformation

Function name Description
rxDataStep * Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output) from an .xdf file or a data frame.
rxFactors * Recode a factor variable or convert non-factor variable into a factor in an .xdf file or data frame.
rxGetFuzzyDist Get fuzzy distances for a character vector.
rxGetFuzzyKeys Get fuzzy keys for a character vector.
rxSplit Splits an .xdf file or data frame into multiple .xdf files or data frames.
rxSort Multi-key sorting of the variables an .xdf file or data frame.
rxMerge Merges two .xdf files or data frames using a variety of merge types.
rxExecuteSQLDDL SQL Server R Services only. Runs an arbitrary SQL DDL command.

* Signifies the most popular functions in this category.

4-Graphing functions

Function name Description
rxHistogram Creates a histogram from data.
rxLinePlot Creates a line plot from data.
rxLorenz Computes a Lorenz curve which can be plotted.
rxRocCurve Computes and plots ROC curves from actual and predicted data.

5-Descriptive statistics

Function name Description
rxQuantile * Computes approximate quantiles for .xdf files and data frames without sorting.
rxSummary * Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported.
rxCrossTabs * Formula-based cross-tabulation of data.
rxCube * Alternative formula-based cross-tabulation designed for efficient representation returning cube results. Writing output to .xdf file not supported.
rxMarginals Marginal summaries of cross-tabulations.
as.xtabs Converts cross tabulation results to an xtabs object.
rxChiSquaredTest Performs Chi-squared Test on xtabs object. Used with small data sets and does not chunk data.
rxFisherTest Performs Fisher's Exact Test on xtabs object. Used with small data sets and does not chunk data.
rxKendallCor Computes Kendall's Tau Rank Correlation Coefficient using xtabs object.
rxPairwiseCrossTab Apply a function to pairwise combinations of rows and columns of an xtabs object.
rxRiskRatio Calculate the relative risk on a two-by-two xtabs object.
rxOddsRatio Calculate the odds ratio on a two-by-two xtabs object.

* Signifies the most popular functions in this category.

6-Prediction functions

Function name Description
rxLinMod * Fits a linear model to data.
rxLogit * Fits a logistic regression model to data.
rxGlm * Fits a generalized linear model to data.
rxCovCor * Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables.
rxDTree * Fits a classification or regression tree to data.
rxBTrees * Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm.
rxDForest * Fits a classification or regression decision forest to data.
rxPredict * Calculates predictions for fitted models. Output must be an XDF data source.
rxKmeans * Performs k-means clustering.
rxNaiveBayes * Performs Naive Bayes classification.
rxCov Calculate the covariance matrix for a set of variables.
rxCor Calculate the correlation matrix for a set of variables.
rxSSCP Calculate the sum of squares / cross-product matrix for a set of variables.
rxRoc Receiver Operating Characteristic (ROC) computations using actual and predicted values from binary classifier system.

* Signifies the most popular functions in this category.

7-Compute context functions

Function name Description
rxSetComputeContext Sets a compute context.
rxGetComputeContext Gets the current compute context.
RxHadoopMR Creates an in-data, file-based Hadoop compute context.
RxSpark Creates an in-data, file-based Spark compute context. Computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark.
RxInTeradata Creates an in-database compute context for Teradata.
RxInSqlServer Creates an in-database compute context for SQL Server.
RxComputeContext Creates a compute context.
RxLocalSeq Creates a local compute context for rxExec using sequential computations.
RxLocalParallel Creates a local compute context for rxExec using the **parallel* package as backend.
RxForeachDoPar Creates a compute context for rxExec using the current foreach parallel backend.
rxInstalledPackages Returns the list of installed packages for a compute context.
rxFindPackage Returns the path to one or more packages for a compute context.

8-Distributed computing

These functions and many more can be used for high performance computing and distributed computing. Learn more about the entire set of functions in Distributed Computing.

Function name Description
rxExec Run an arbitrary R function on nodes or cores of a cluster.
rxRngNewStream Support for Parallel Random Number Generation.
rxRngDelStream Support for Parallel Random Number Generation.
rxRngGetStream Support for Parallel Random Number Generation.
rxRngSetStream Support for Parallel Random Number Generation.
rxGetAvailableNodes Get all the available nodes on a distributed compute context.
rxGetNodeInfo Get information on nodes specified for a distributed compute context.
rxPingNodes Test round trip from user through computation node(s) in a cluster or cloud.
rxGetJobStatus Get the status of a non-waiting distributed computing job.
rxGetJobResults Get the return object(s) of a non-waiting distributed computing job.
rxGetJobOutput Get the console output from a non-waiting distributed computing job.
rxGetJobs Get the available distributed computing job information objects.
rxLocateFile Get the first occurrence of a specified input file in a set of specified paths.

9-Utility functions

Some of the utility functions are operational in local compute context only. Check the documentation of individual functions to confirm.

Function name Description
rxOptions Gets or sets a specific option.
rxGetOption Retrieves a specific RevoScaleR option.
rxGetEnableThreadPool Gets the current state of the thread pool, which on Linux can be either persistent or on-demand.
rxSetEnableThreadPool Sets the thread pool state.
rxStepControl Construct a variable.selection argument for rxLinMod.
rxIsOpen Indicates whether a data source can be accessed.
rxSqlServerDropTable Execute an SQL statement that drops a table.
rxSqlServerTableExists Execute an SQL statement that checks for a table's existence.
rxWriteNext Writes the next chunk when moving data between RevoScaleR data sources.

Next steps

Add R packages to your computer by running setup:

Next, follow these tutorials for hands on experience:

See also

R Package Reference