Quickstart: Execute an R script on an ML Services cluster in Azure HDInsight using RStudio Server

Important

This content is retired and will not be updated in the future. Azure HDInsight 3.6 ML Services (Machine Learning Server) cluster type was retired as of Dec 31, 2020.

ML Services on Azure HDInsight allows R scripts to use Apache Spark and Apache Hadoop MapReduce to run distributed computations. ML Services controls how calls are executed by setting the compute context. The edge node of a cluster provides a convenient place to connect to the cluster and to run your R scripts. With an edge node, you have the option of running the parallelized distributed functions of RevoScaleR across the cores of the edge node server. You can also run them across the nodes of the cluster by using RevoScaleR's Hadoop Map Reduce or Apache Spark compute contexts.

In this quickstart, you learn how to run an R script with RStudio Server that demonstrates using Spark for distributed R computations. You will define a compute context to perform computations locally on an edge node, and again distributed across the nodes in the HDInsight cluster.

Prerequisite

An ML Services cluster on HDInsight. See Create Apache Hadoop clusters using the Azure portal and select ML Services for Cluster type.

Connect to RStudio Server

RStudio Server runs on the cluster's edge node. Go to the following URL where CLUSTERNAME is the name of the ML Services cluster you created:

https://CLUSTERNAME.azurehdinsight.net/rstudio/

The first time you sign in you need to authenticate twice. For the first authentication prompt, provide the cluster Admin login and password, default is admin. For the second authentication prompt, provide the SSH login and password, default is sshuser. Subsequent sign-ins only require the SSH credentials.

Once you are connected, your screen should resemble the following screenshot:

R studio web console overviews

Use a compute context

  1. From RStudio Server, use the following code to load example data into the default storage for HDInsight:

    # Set the HDFS (WASB) location of example data
     bigDataDirRoot <- "/example/data"
    
     # create a local folder for storing data temporarily
     source <- "/tmp/AirOnTimeCSV2012"
     dir.create(source)
    
     # Download data to the tmp folder
     remoteDir <- "https://packages.revolutionanalytics.com/datasets/AirOnTimeCSV2012"
     download.file(file.path(remoteDir, "airOT201201.csv"), file.path(source, "airOT201201.csv"))
     download.file(file.path(remoteDir, "airOT201202.csv"), file.path(source, "airOT201202.csv"))
     download.file(file.path(remoteDir, "airOT201203.csv"), file.path(source, "airOT201203.csv"))
     download.file(file.path(remoteDir, "airOT201204.csv"), file.path(source, "airOT201204.csv"))
     download.file(file.path(remoteDir, "airOT201205.csv"), file.path(source, "airOT201205.csv"))
     download.file(file.path(remoteDir, "airOT201206.csv"), file.path(source, "airOT201206.csv"))
     download.file(file.path(remoteDir, "airOT201207.csv"), file.path(source, "airOT201207.csv"))
     download.file(file.path(remoteDir, "airOT201208.csv"), file.path(source, "airOT201208.csv"))
     download.file(file.path(remoteDir, "airOT201209.csv"), file.path(source, "airOT201209.csv"))
     download.file(file.path(remoteDir, "airOT201210.csv"), file.path(source, "airOT201210.csv"))
     download.file(file.path(remoteDir, "airOT201211.csv"), file.path(source, "airOT201211.csv"))
     download.file(file.path(remoteDir, "airOT201212.csv"), file.path(source, "airOT201212.csv"))
    
     # Set directory in bigDataDirRoot to load the data into
     inputDir <- file.path(bigDataDirRoot,"AirOnTimeCSV2012")
    
     # Make the directory
     rxHadoopMakeDir(inputDir)
    
     # Copy the data from source to input
     rxHadoopCopyFromLocal(source, bigDataDirRoot)
    

    This step may take around 8 minutes to complete.

  2. Create some data info and define two data sources. Enter the following code in RStudio:

    # Define the HDFS (WASB) file system
     hdfsFS <- RxHdfsFileSystem()
    
     # Create info list for the airline data
     airlineColInfo <- list(
          DAY_OF_WEEK = list(type = "factor"),
          ORIGIN = list(type = "factor"),
          DEST = list(type = "factor"),
          DEP_TIME = list(type = "integer"),
          ARR_DEL15 = list(type = "logical"))
    
     # get all the column names
     varNames <- names(airlineColInfo)
    
     # Define the text data source in hdfs
     airOnTimeData <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames, fileSystem = hdfsFS)
    
     # Define the text data source in local system
     airOnTimeDataLocal <- RxTextData(source, colInfo = airlineColInfo, varsToKeep = varNames)
    
     # formula to use
     formula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST"
    
  3. Run a logistic regression over the data using the local compute context. Enter the following code in RStudio:

    # Set a local compute context
     rxSetComputeContext("local")
    
     # Run a logistic regression
     system.time(
        modelLocal <- rxLogit(formula, data = airOnTimeDataLocal)
     )
    
     # Display a summary
     summary(modelLocal)
    

    The computations should complete in about 7 minutes. You should see output that ends with lines similar to the following snippet:

    Data: airOnTimeDataLocal (RxTextData Data Source)
     File name: /tmp/AirOnTimeCSV2012
     Dependent variable(s): ARR_DEL15
     Total independent variables: 634 (Including number dropped: 3)
     Number of valid observations: 6005381
     Number of missing observations: 91381
     -2*LogLikelihood: 5143814.1504 (Residual deviance on 6004750 degrees of freedom)
    
     Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
      (Intercept)   -3.370e+00  1.051e+00  -3.208  0.00134 **
      ORIGIN=JFK     4.549e-01  7.915e-01   0.575  0.56548
      ORIGIN=LAX     5.265e-01  7.915e-01   0.665  0.50590
      ......
      DEST=SHD       5.975e-01  9.371e-01   0.638  0.52377
      DEST=TTN       4.563e-01  9.520e-01   0.479  0.63172
      DEST=LAR      -1.270e+00  7.575e-01  -1.676  0.09364 .
      DEST=BPT         Dropped    Dropped Dropped  Dropped
    
      ---
    
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
      Condition number of final variance-covariance matrix: 11904202
      Number of iterations: 7
    
  4. Run the same logistic regression using the Spark context. The Spark context distributes the processing over all the worker nodes in the HDInsight cluster. Enter the following code in RStudio:

    # Define the Spark compute context
     mySparkCluster <- RxSpark()
    
     # Set the compute context
     rxSetComputeContext(mySparkCluster)
    
     # Run a logistic regression
     system.time(  
        modelSpark <- rxLogit(formula, data = airOnTimeData)
     )
    
     # Display a summary
     summary(modelSpark)
    

    The computations should complete in about 5 minutes.

Clean up resources

After you complete the quickstart, you may want to delete the cluster. With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it is not in use. You are also charged for an HDInsight cluster, even when it is not in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they are not in use.

To delete a cluster, see Delete an HDInsight cluster using your browser, PowerShell, or the Azure CLI.

Next steps

In this quickstart, you learned how to run an R script with RStudio Server that demonstrated using Spark for distributed R computations. Advance to the next article to learn the options that are available to specify whether and how execution is parallelized across cores of the edge node or HDInsight cluster.

Note

This page describes features of RStudio software. Microsoft Azure HDInsight is not affiliated with RStudio, Inc.