Azure Storage solutions for R Server on HDInsight

Microsoft R Server on HDInsight has a variety of storage solutions to persist data, code, or objects that contain results from analysis. These include the following options:

You also have the option of accessing multiple Azure storage accounts or containers with your HDI cluster. Azure File storage is a convenient data storage option for use on the edge node that enables you to mount an Azure Storage file share to, for example, the Linux file system. But Azure File shares can be mounted and used by any system that has a supported OS such as Windows or Linux.

When you create an Hadoop cluster in HDInsight, you specify either an Azure storage account or a Data Lake store. A specific storage container from that account holds the file system for the cluster that you create (for example, the Hadoop Distributed File System). For more information and guidance, see:

For more information on the Azure storage solutions, see Introduction to Microsoft Azure Storage.

For guidance on selecting the most appropriate storage option to use for your scenario, see Deciding when to use Azure Blobs, Azure Files, or Azure Data Disks

Use Azure Blob storage accounts with R Server

If you specified more than one storage account when creating your R Server cluster, the following instructions explain how to use a secondary account for data access and operations on R Server. Assume the following storage accounts and container: storage1 and a default container called container1, and storage2.

Warning

For performance purposes, the HDInsight cluster is created in the same data center as the primary storage account that you specify. Using a storage account in a different location than the HDInsight cluster is not supported.

  1. Using an SSH client, connect to the edge node of your cluster as remoteuser.

    • In Azure portal > HDI cluster service page > Overview, click Secure Shell (SSH).
    • In Hostname, select the edge node (it includes ed-ssh.azurehdinsight.net in the name).
    • Copy the host name.
    • Open an SSH client like PutTY or SmartTY and enter the host name.
    • Enter remoteuser for the user name, followed by the cluster password.
  2. Copy the mycsv.csv file to the /share directory.

     hadoop fs –mkdir /share
     hadoop fs –copyFromLocal myscsv.scv /share  
    
  3. Switch to R Studio or another R console, and write R code to set the name node to default and location of the file you want to access.

     myNameNode <- "default"
     myPort <- 0
    
     #Location of the data:  
     bigDataDirRoot <- "/share"  
    
     #Define Spark compute context:
     mySparkCluster <- RxSpark(nameNode=myNameNode, consoleOutput=TRUE)
    
     #Set compute context:
     rxSetComputeContext(mySparkCluster)
    
     #Define the Hadoop Distributed File System (HDFS) file system:
     hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)
    
     #Specify the input file to analyze in HDFS:
     inputFile <-file.path(bigDataDirRoot,"mycsv.csv")
    

All the directory and file references point to the storage account wasb://container1@storage1.blob.core.windows.net. This is the default storage account that's associated with the HDInsight cluster.

Now, suppose you want to process a file called mySpecial.csv that's located in the /private directory of container2 in storage2.

In your R code, point the name node reference to the storage2 storage account.

myNameNode <- "wasb://container2@storage2.blob.core.windows.net"
myPort <- 0

#Location of the data:
bigDataDirRoot <- "/private"

#Define Spark compute context:
mySparkCluster <- RxSpark(consoleOutput=TRUE, nameNode=myNameNode, port=myPort)

#Set compute context:
rxSetComputeContext(mySparkCluster)

#Define HDFS file system:
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)

#Specify the input file to analyze in HDFS:
inputFile <-file.path(bigDataDirRoot,"mySpecial.csv")

All of the directory and file references now point to the storage account wasb://container2@storage2.blob.core.windows.net. This is the Name Node that you’ve specified.

You have to configure the /user/RevoShare/ directory on storage2 as follows:

hadoop fs -mkdir wasb://container2@storage2.blob.core.windows.net/user
hadoop fs -mkdir wasb://container2@storage2.blob.core.windows.net/user/RevoShare
hadoop fs -mkdir wasb://container2@storage2.blob.core.windows.net/user/RevoShare/<RDP username>

Use an Azure Data Lake store with R Server

To use Data Lake stores with your HDInsight account, you need to give your cluster access to each Azure Data Lake store that you want to use. For instructions on how to use the Azure portal to create a HDInsight cluster with an Azure Data Lake Store account as the default storage or as an additional store, see Create an HDInsight cluster with Data Lake Store using Azure portal.

You then use the store in your R script much like you did a secondary Azure storage account as described in the previous procedure.

Add cluster access to your Azure Data Lake stores

You access a Data Lake store by using an Azure Active Directory (Azure AD) Service Principal that's associated with your HDInsight cluster.

To add an Azure AD Service Principal:

  1. When you create your HDInsight cluster, select Cluster AAD Identity from the Data Source tab.

  2. In the Cluster AAD Identity dialog box, under Select AD Service Principal, select Create new.

After you give the Service Principal a name and create a password for it, click Manage ADLS Access to associate the Service Principal with your Data Lake stores.

It’s also possible to add cluster access to one or more Data Lake stores following cluster creation. Open the Azure portal entry for a Data Lake store and go to Data Explorer > Access > Add.

How to access the Data Lake store from R Server

Once you’ve given access to a Data Lake store, you can use the store in R Server on HDInsight the way you would a secondary Azure storage account. The only difference is that the prefix wasb:// changes to adl:// as follows:

# Point to the ADL store (e.g. ADLtest)
myNameNode <- "adl://rkadl1.azuredatalakestore.net"
myPort <- 0

# Location of the data (assumes a /share directory on the ADL account)
bigDataDirRoot <- "/share"  

# Define Spark compute context
mySparkCluster <- RxSpark(consoleOutput=TRUE, nameNode=myNameNode, port=myPort)

# Set compute context
rxSetComputeContext(mySparkCluster)

# Define HDFS file system
hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort)

# Specify the input file in HDFS to analyze
inputFile <-file.path(bigDataDirRoot,"AirlineDemoSmall.csv")

# Create factors for days of the week
colInfo <- list(DayOfWeek = list(type = "factor",
           levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
                      "Friday", "Saturday", "Sunday")))

# Define the data source
airDS <- RxTextData(file = inputFile, missingValueString = "M",
                colInfo  = colInfo, fileSystem = hdfsFS)

# Run a linear regression
model <- rxLinMod(ArrDelay~CRSDepTime+DayOfWeek, data = airDS)

The following commands are used to configure the Data Lake storage account with the RevoShare directory and add the sample .csv file from the previous example:

hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user/RevoShare
hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/user/RevoShare/<user>

hadoop fs -mkdir adl://rkadl1.azuredatalakestore.net/share

hadoop fs -copyFromLocal /usr/lib64/R Server-7.4.1/library/RevoScaleR/SampleData/AirlineDemoSmall.csv adl://rkadl1.azuredatalakestore.net/share

hadoop fs –ls adl://rkadl1.azuredatalakestore.net/share

Use Azure File storage with R Server

There is also a convenient data storage option for use on the edge node called [Azure Files]((https://azure.microsoft.com/services/storage/files/). It enables you to mount an Azure Storage file share to the Linux file system. This option can be handy for storing data files, R scripts, and result objects that might be needed later, especially when it makes sense to use the native file system on the edge node rather than HDFS.

A major benefit of Azure Files is that the file shares can be mounted and used by any system that has a supported OS such as Windows or Linux. For example, it can be used by another HDInsight cluster that you or someone on your team has, by an Azure VM, or even by an on-premises system. For more information, see:

Next steps

Now that you understand the Azure storage options, use the following links to discover ways of getting data science tasks done with R Server on HDInsight.