Get started with R Server cluster on Azure HDInsight
Azure HDInsight includes an R Server option to be integrated into your HDInsight cluster. This option allows R scripts to use Spark and MapReduce to run distributed computations. In this article, you learn how to create an R Server on HDInsight cluster. You then learn how to run an R script that demonstrates using Spark for distributed R computations.
- An Azure subscription: Before you begin this tutorial, you must have an Azure subscription. For more information, see Get Microsoft Azure free trial.
- A Secure Shell (SSH) client: An SSH client is used to remotely connect to the HDInsight cluster and run commands directly on the cluster. For more information, see Use SSH with HDInsight.
Create the cluster using the Azure portal
Sign in to the Azure portal.
Click Create a resource > Data + Analytics > HDInsight.
From Basics, enter the following information:
- Cluster Name: The name of the HDInsight cluster.
- Subscription: Select the subscription to use.
- Cluster login username and Cluster login password: The login when accessing the cluster over HTTPS. You use these credentials to access services such as the Ambari Web UI or REST API.
- Secure Shell (SSH) username: The login used when accessing the cluster over SSH. By default the password is the same as the cluster login password.
- Resource Group: The resource group to create the cluster in.
Location: The Azure region to create the cluster in.
Select Cluster type, and then set the following values in the Cluster configuration section:
Cluster Type: R Server
Operating system: Linux
Version: R Server 9.1 (HDI 3.6). Release notes for the available versions of R Server are available on docs.microsoft.com.
R Studio community edition for R Server: This browser-based IDE is installed by default on the edge node. Clear the check box if you prefer to not have it installed. If you choose to have it installed, the URL for accessing the RStudio Server login is available on the portal application blade for your cluster once it’s been created.
After selecting the cluster type, use the Select button to set the cluster type. Next, use the Next button to finish basic configuration.
From the Storage section, select or create a Storage account. For the steps in this document, leave the other fields in this section at the default values. Use the Next button to save storage configuration.
From the Summary section, review the configuration for the cluster. Use the Edit links to change any settings that are incorrect. Finally, use the Create button to create the cluster.
It can take up to 20 minutes to create the cluster.
Connect to RStudio Server
If you chose to install RStudio Server Community Edition as part of your HDInsight cluster, then you can access the RStudio login using one of the following two methods:
Option 1 - Go to the following URL (where CLUSTERNAME is the name of the R Server cluster you created):
Option 2 - Open the R Server cluster in the Azure portal, under Quick links click R Server Dashboards.
From Cluster Dashboards, click R Studio Server.
Regardless of the method used, the first time you log in you need to authenticate twice. For the first authentication prompt, provide the cluster Admin userid and password. For the second authentication prompt, provide the SSH userid and password. Subsequent log ins only require the SSH credentials.
Once you are connected, your screen should resemble the following screenshot:
Run a sample job
You can submit a job using ScaleR functions. Here is an example of the commands used to run a job:
# Set the HDFS (WASB) location of example data. bigDataDirRoot <- "/example/data" # Create a local folder for storaging data temporarily. source <- "/tmp/AirOnTimeCSV2012" dir.create(source) # Download data to the tmp folder. remoteDir <- "https://packages.revolutionanalytics.com/datasets/AirOnTimeCSV2012" download.file(file.path(remoteDir, "airOT201201.csv"), file.path(source, "airOT201201.csv")) download.file(file.path(remoteDir, "airOT201202.csv"), file.path(source, "airOT201202.csv")) download.file(file.path(remoteDir, "airOT201203.csv"), file.path(source, "airOT201203.csv")) download.file(file.path(remoteDir, "airOT201204.csv"), file.path(source, "airOT201204.csv")) download.file(file.path(remoteDir, "airOT201205.csv"), file.path(source, "airOT201205.csv")) download.file(file.path(remoteDir, "airOT201206.csv"), file.path(source, "airOT201206.csv")) download.file(file.path(remoteDir, "airOT201207.csv"), file.path(source, "airOT201207.csv")) download.file(file.path(remoteDir, "airOT201208.csv"), file.path(source, "airOT201208.csv")) download.file(file.path(remoteDir, "airOT201209.csv"), file.path(source, "airOT201209.csv")) download.file(file.path(remoteDir, "airOT201210.csv"), file.path(source, "airOT201210.csv")) download.file(file.path(remoteDir, "airOT201211.csv"), file.path(source, "airOT201211.csv")) download.file(file.path(remoteDir, "airOT201212.csv"), file.path(source, "airOT201212.csv")) # Set directory in bigDataDirRoot to load the data. inputDir <- file.path(bigDataDirRoot,"AirOnTimeCSV2012") # Create the directory. rxHadoopMakeDir(inputDir) # Copy the data from source to input. rxHadoopCopyFromLocal(source, bigDataDirRoot) # Define the HDFS (WASB) file system. hdfsFS <- RxHdfsFileSystem() # Create info list for the airline data. airlineColInfo <- list( DAY_OF_WEEK = list(type = "factor"), ORIGIN = list(type = "factor"), DEST = list(type = "factor"), DEP_TIME = list(type = "integer"), ARR_DEL15 = list(type = "logical")) # Get all the column names. varNames <- names(airlineColInfo) # Define the text data source in HDFS. airOnTimeData <- RxTextData(inputDir, colInfo = airlineColInfo, varsToKeep = varNames, fileSystem = hdfsFS) # Define the text data source in local system. airOnTimeDataLocal <- RxTextData(source, colInfo = airlineColInfo, varsToKeep = varNames) # Specify the formula to use. formula = "ARR_DEL15 ~ ORIGIN + DAY_OF_WEEK + DEP_TIME + DEST" # Define the Spark compute context. mySparkCluster <- RxSpark() # Set the compute context. rxSetComputeContext(mySparkCluster) # Run a logistic regression. system.time( modelSpark <- rxLogit(formula, data = airOnTimeData) ) # Display a summary. summary(modelSpark)
Connect to the cluster edge node
In this section, you learn how to connect to the edge node of an R Server HDInsight cluster using SSH. For familiarity on using SSH, see Use SSH with HDInsight.
The SSH command to connect to the R Server cluster edge node is:
To find the SSH command for your cluster, from the Azure portal click the cluster name, click SSH + Cluster login, and then for Hostname, select the edge node. This displays the SSH Endpoint information for the edge node.
If you used a password to secure your SSH user account, you are prompted to enter it. If you used a public key, you may have to use the
-i parameter to specify the matching private key. For example:
ssh -i ~/.ssh/id_rsa USERNAME@CLUSTERNAME-ed-ssh.azurehdinsight.net
Once connected, you get at a prompt similar to the following:
Use the R Server console
From the SSH session, use the following command to start the R console:
You should see an output with the version of R Server, in addition to other information.
>prompt, you can enter R code. R Server on HDInsight includes packages that allow you to easily interact with Hadoop and run distributed computations. For example, use the following command to view the root of the default file system for the HDInsight cluster:
You can also use the WASB style addressing.
To quit the R console, use the following command:
Automated cluster creation
You can automate the creation of R Server cluster for HDInsight by using Azure Resource Manager templates, the SDK, and the PowerShell.
- To create an R Server cluster using an Azure Resource Management template, see Deploy an R Server for HDInsight cluster.
- To create an R Server cluster using the .NET SDK, see Create Linux-based clusters in HDInsight using the .NET SDK.
- To create an R Server cluster using powershell, see the article on Create HDInsight clusters using Azure PowerShell.
Delete the cluster
Billing for HDInsight clusters is prorated per minute, whether you are using them or not. Be sure to delete your cluster after you have finished using it. For more information, see How to delete an HDInsight cluster.
If you run into issues with creating HDInsight clusters, see access control requirements.
In this article you learned how to create a new R Server cluster in Azure HDInsight and the basics of using the R console from an SSH session. The following articles explain other ways of managing and working with R Server on HDInsight: