RevoScaleR Functions for Spark on Hadoop

The RevoScaleR package provides a set of portable, scalable, distributable data analysis functions. This page presents a curated list of functions that might be particularly interesting to Hadoop users. These functions can be called directly from the command line.

The RevoScaleR package supports two Hadoop compute contexts:

  • RxSpark (recommended), a distributed compute context in which computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. This provides up to a 7x performance boost compared to RxHadoopMR. For guidance, see How to use RevoScaleR on Spark.

  • RxHadoopMR (deprecated), a distributed compute context on a Hadoop cluster. This compute context can be used on a node (including an edge node) of a Cloudera or Hortonworks cluster with a RHEL operating system, or a client with an SSH connection to such a cluster. For guidance, see How to use RevoScaleR on Hadoop MapReduce.

On Hadoop Distributed File System (HDFS), the XDF file format stores data in a composite set of files rather than a single file.

Data Analysis Functions

Import and Export Functions


Function Name Description
Help
rxDataStep
-
Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output data) from an .xdf file or a data frame.
View
RxXdfData
-
Creates an efficient XDF data source object.
View
RxTextData
-
Creates a comma delimited text data source object.
View
rxGetInfo
-
Retrieves summary information from a data source or data frame.
View
rxGetVarInfo Retrieves variable information from a data source or data frame.
View
rxGetVarNames Retrieves variable names from a data source or data frame.
View
rxHdfsFileSystem Creates an HDFS file system object.
View

#### Manipulation, Cleansing, and Transformation Functions
Function Name Description
Help
rxDataStep
-
Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output) from an .xdf file or a data frame.
View
rxFactors
-
Create or recode factor variables in a composite XDF file in HDFS. A new file must be written out.
View

#### Analysis Functions for Descriptive Statistics and Cross-Tabulations
Function Name Description
Help
rxQuantile
-
Computes approximate quantiles for .xdf files and data frames without sorting.
View
rxSummary
-
Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported.
View
rxCrossTabs
-
Formula-based cross-tabulation of data.
View
rxCube
-
Alternative formula-based cross-tabulation designed for efficient representation returning ‘cube’ results. Writing output to .xdf file not supported.
View


#### Analysis, Learning, and Prediction Functions for Statistical Modeling
Function Name Description
Help
rxLinMod
-
Fits a linear model to data.
View
rxLogit
-
Fits a logistic regression model to data.
View
rxGlm
-
Fits a generalized linear model to data.
View
rxCovCor
-
Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables.
View
rxDTree
-
Fits a classification or regression tree to data.
View
rxBTrees
-
Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm.
View
rxDForest
-
Fits a classification or regression decision forest to data.
View
rxPredict
-
Calculates predictions for fitted models. Output must be an XDF data source.
View
rxKmeans
-
Performs k-means clustering.
View
rxNaiveBayes
-
Fit Naive Bayes Classifiers on an .xdf file or data frame for small or large data using parallel external memory algorithm.
View

Compute Context Functions

Function Name Description
Help
RxHadoopMR
-
Creates an in-data, file-based Hadoop compute context.
View
RxSpark
-
Creates an in-data, file-based Spark compute context. Computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark.
View
rxSparkConnect Creates a persistent Spark compute context.
View
rxSparkDisconnect Disconnects a Spark session and return to a local compute context.
View
rxInstalledPackages Returns the list of installed packages for a compute context.
View
rxFindPackage Returns the path to one or more packages for a compute context.
View

Data Source Functions

Of course, not all data source types are available on all compute contexts. For the Hadoop compute contexts, two types of data sources can be used.

    <td>Creates an efficient XDF data source object.</td>
    <td>
        <center><small><a href="rxxdfdata.md" data-raw-source="[**View**](rxxdfdata.md)"><strong>View</strong></a></small></center>
    </td>
</tr>
<tr>
    <td><code>RxTextData</code></td>
    <td>
        <center><img src="./media/revoscaler-hadoop-functions/award.png" alt="-"/></center>
    </td>
    <td>Creates a comma delimited text data source object.</td>
    <td>
        <center><small><a href="rxtextdata.md" data-raw-source="[**View**](rxtextdata.md)"><strong>View</strong></a></small></center>
    </td>
</tr>    <tr>
    <td><code>RxHiveData</code></td>
    <td> </td>
    <td>Generates a Hive Data Source object.</td>
    <td>
        <center><small><a href="rxsparkdata.md" data-raw-source="[**View**](rxsparkdata.md)"><strong>View</strong></a><center></small></td>
    </tr>
    <tr>
    <td><code>RxParquetData</code></td>
    <td> </td>
    <td>Generates a Parquet Data Source object.</td>
    <td>
        <center><small><a href="rxsparkdata.md" data-raw-source="[**View**](rxsparkdata.md)"><strong>View</strong></a><center></small></td>
    </tr>
    <tr>
    <td><code>rxSparkDataOps</code> </td>
    <td> </td>
    <td>Lists cached <code>RxParquetData</code> or <code>RxHiveData</code> data source objects. </td>
    <td>
        <center><small><a href="rxsparkdataops.md" data-raw-source="[**View**](rxsparkdataops.md)"><strong>View</strong></a><center></small></td><br/>        </tr>
    <tr>
    <td><code>rxSparkRemoveData</code></td>
    <td> </td>
    <td>Removes cached <code>RxParquetData</code> or <code>RxHiveData</code> data source objects.</td>
    <td>
        <center><small><a href="rxsparkdataops.md" data-raw-source="[**View**](rxsparkdataops.md)"><strong>View</strong></a><center></small></td>
    </tr>
Function Name Description
Help
RxXdfData
-

## High Performance Computing and Distributed Computing Functions

The Hadoop compute context has a number of helpful functions used for high performance computing and distributed computing. Learn more about the entire set of functions in the Distributed Computing guide.

Function Name Description
Help
rxExec Run an arbitrary R function on nodes or cores of a cluster.
View
rxGetJobStatus Get the status of a non-waiting distributed computing job.
View
rxGetJobResults Get the return object(s) of a non-waiting distributed computing job.
View
rxGetJobOutput Get the console output from a non-waiting distributed computing job.
View
rxGetJobs Get the available distributed computing job information objects.
View

## Hadoop Convenience Functions

RevoScaleR also provides some wrapper functions for accessing Hadoop/HDFS functionality via R. These functions require access to Hadoop, either locally or remotely via the RxHadoopMR or RxSpark compute contexts.

Function Name Description
Help
rxHadoopCommand Execute an arbitrary Hadoop command. Allows you to run basic Hadoop commands.
View
rxHadoopVersion Return the current Hadoop version.
View
rxHadoopCopyFromClient Copy a file from a remote client to the Hadoop cluster's local file system, and then to HDFS.
View
rxHadoopCopyFromLocal Copy a file from the native file system to HDFS. Wraps the Hadoop fs -copyFromLocal command.
View
rxHadoopCopy Copy a file in the Hadoop Distributed File System (HDFS). Wraps the Hadoop fs -cp command.
View
rxHadoopRemove Remove a file in HDFS. Wraps the Hadoop fs -rm command.
View
rxHadoopListFiles List files in an HDFS directory. Wraps the Hadoop fs -ls or fs -lsr command.
View
rxHadoopMakeDir Make a directory in HDFS. Wraps the Hadoop fs -mkdir command.
View
rxHadoopMove Move a file in HDFS. Wraps the Hadoop fs -mv command.
View
rxHadoopRemoveDir Remove a directory in HDFS. Wraps the Hadoop fs -rmr command.
View