Introduction to R Server and open-source R capabilities on HDInsight

Microsoft R Server (also know as Microsoft Machine Learning Server) is available as a deployment option when you create HDInsight clusters in Azure. The cluster type that provides this option is called R Server. This capability provides data scientists, statisticians, and R programmers with on-demand access to scalable, distributed methods of analytics on HDInsight.

R Server on HDInsight provides the latest capabilities for R-based analytics on datasets of virtually any size, loaded to either Azure Blob or Data Lake storage. Since R Server cluster is built on open-source R, the R-based applications you build can leverage any of the 8000+ open-source R packages. The routines in ScaleR, Microsoft’s big data analytics package included with R Server, are also available.

The edge node of a cluster provides a convenient place to connect to the cluster and to run your R scripts. With an edge node, you have the option of running the parallelized distributed functions of ScaleR across the cores of the edge node server. You can also run them across the nodes of the cluster by using ScaleR’s Hadoop Map Reduce or Spark compute contexts.

The models or predictions that result from analysis can be downloaded for on-premises use. They can also be operationalized elsewhere in Azure, in particular through Azure Machine Learning Studio web service.

Get started with R Server on HDInsight

To create an R Server cluster in Azure HDInsight, select the R Server cluster type when creating an HDInsight cluster using the Azure portal. The R Server cluster type includes R Server on the data nodes of the cluster and on an edge node, which serves as a landing zone for R Server-based analytics. See Getting Started with R Server on HDInsight for a walkthrough on how to create the cluster.

Why choose R Server in HDInsight?

R Server in HDInsight provides the option to use the R Server on Azure. R Server includes:

  • The best AI innovation from Microsoft and open source

    Microsoft strives to make AI accessible to every individual and organization. R Server includes a rich set of highly-scalable, distributed set of algorithms such as RevoscaleR, revoscalepy, and microsoftML that can work on data sizes larger than the size of physical memory, and run on a wide variety of platforms in a distributed manner. Learn more about the collection of Microsoft's custom R packages and Python packages included with the product.

    R Server bridges these Microsoft innovations and those coming from the open source community (R, Python and AI toolkits) all on top of a single enterprise-grade platform. Any R or Python open source machine learning package can work side-by-side with any proprietary innovation from Microsoft.

  • Simple, secure, and high-scale operationalization and administration

    Enterprises relying on traditional operationalization paradigms and environments end up investing much time and effort towards this area. Pain points often resulting in inflated costs and delays include: the translation time for models, iterations to keep them valid and current, regulatory approval, managing permissions through operationalization.

    R Server offers best-in-class operationalization -- from the time a machine learning model is completed, it takes just a few clicks to generate web services APIs. These web services are hosted on a server grid on-premises or in the cloud and can be integrated with line-of-business applications. The ability to deploy to an elastic grid lets you scale seamlessly with the needs of your business, both for batch and real-time scoring. For instructions, see Operationalize R Server on HDInsight.

  • Deep ecosystem engagements to deliver customer success with optimal total cost of ownership

    Individuals embarking on the journey of making their applications intelligent or simply wanting to learn the new world of AI and machine learning, need the right resources to help them get started. In addition to this documentation, Microsoft provides several learning resources and has engaged several training partners to help you ramp up and become productive quickly.

Key features of R Server on HDInsight

The following features are included in R Server on HDInsight.

Feature category Description
R-enabled R packages for solutions written in R, with an open source distribution of R and run-time infrastructure for script execution.
Python-enabled Python modules for solutions written in Python, with an open source distribution of Python and run-time infrastructure for script execution.
Pre-trained models For visual analysis and text sentiment analysis, ready to score data you provide.
Deploy and consume Operationalize your server and deploy solutions as a web service.
Remote execution Start remote sessions on R Server on your network from your client workstation.

Data storage options for R Server on HDInsight

Default storage for the HDFS file system of HDInsight clusters can be associated with either an Azure Storage account or an Azure Data Lake Store. This association ensures that whatever data is uploaded to the cluster storage during analysis is made persistent and the data is available even after the cluster is deleted. There are various tools for handling the data transfer to the storage option that you select, including the portal-based upload facility of the storage account and the AzCopy utility.

You have the option of adding access to additional Blob and Data lake stores during the cluster provisioning process regardless of the primary storage option in use. See Getting started with R Server on HDInsight for information on adding access to additional accounts. See the supplementary Azure Storage options for R Server on HDInsight article to learn more about using multiple storage accounts.

You can also use Azure Files as a storage option for use on the edge node. Azure Files enables you to mount a file share that was created in Azure Storage to the Linux file system. For more information about these data storage options for R Server on HDInsight cluster, see Azure Storage options for R Server on HDInsight.

Access Machine Learning Server on the cluster

You can connect to Microsoft Machine Learning Server on the edge node using a browser. It is installed by default during cluster creation. For more information, see Get stared with R Server on HDInsight. You can also connect to the ML Server from the command line by using SSH/PuTTY to access the R console.

Develop and run R scripts

The R scripts you create and run can use any of the 8000+ open-source R packages in addition to the parallelized and distributed routines available in the ScaleR library. In general, a script that is run with R Server on the edge node runs within the R interpreter on that node. The exceptions are those steps that need to call a ScaleR function with a compute context that is set to Hadoop Map Reduce (RxHadoopMR) or Spark (RxSpark). In this case, the function runs in a distributed fashion across those data (task) nodes of the cluster that are associated with the data referenced. For more information about the different compute context options, see Compute context options for R Server on HDInsight.

Operationalize a model

When your data modeling is complete, you can operationalize the model to make predictions for new data either from Azure and on-premises. This process is known as scoring. Scoring can be done in HDInsight, Azure Machine Learning, or on-premises.

Score in HDInsight

To score in HDInsight, write an R function that calls your model to make predictions for a new data file that you've loaded to your storage account. Then, save the predictions back to the storage account. You can run this routine on-demand on the edge node of your cluster or by using a scheduled job.

Score in Azure Machine Learning (AML)

To score using Azure Machine Learning, use the open-source Azure Machine Learning R package known as AzureML to publish your model as an Azure web service. For convenience, this package is pre-installed on the edge node. Next, use the facilities in Azure Machine Learning to create a user interface for the web service, and then call the web service as needed for scoring.

If you choose this option, you must convert any ScaleR model objects to equivalent open-source model objects for use with the web service. Use ScaleR coercion functions, such as as.randomForest() for ensemble-based models, for this conversion.

Score on-premises

To score on-premises after creating your model, you can serialize the model in R, download it, de-serialize it, and then use it for scoring new data. You can score new data by using the approach described earlier in Scoring in HDInsight or by using DeployR.

Maintain the cluster

Install and maintain R packages

Most of the R packages that you use are required on the edge node since most steps of your R scripts run there. To install additional R packages on the edge node, you can use the usual install.packages() method in R.

If you are just using routines from the ScaleR library across the cluster, you do not usually need to install additional R packages on the data nodes. However, you might need additional packages to support the use of rxExec or RxDataStep execution on the data nodes.

In such cases, the additional packages can be installed with a script action after you create the cluster. For more information, see Manage R Server in HDInsight cluster.

Change Hadoop MapReduce memory settings

A cluster can be modified to change the amount of memory that is available to R Server when it is running a MapReduce job. To modify a cluster, use the Apache Ambari UI that's available through the Azure portal blade for your cluster. For instructions about how to access the Ambari UI for your cluster, see Manage HDInsight clusters using the Ambari Web UI.

It is also possible to change the amount of memory that is available to R Server by using Hadoop switches in the call to RxHadoopMR as follows:

hadoopSwitches = "-libjars /etc/hadoop/conf -Dmapred.job.map.memory.mb=6656"  

Scale your cluster

An existing R Server cluster on HDInsight can be scaled up or down through the portal. By scaling up, you can gain the additional capacity that you might need for larger processing tasks, or you can scale back a cluster when it is idle. For instructions about how to scale a cluster, see Manage HDInsight clusters.

Maintain the system

Maintenance to apply OS patches and other updates is performed on the underlying Linux VMs in an HDInsight cluster during off-hours. Typically, maintenance is done at 3:30 AM (based on the local time for the VM) every Monday and Thursday. Updates are performed in such a way that they don't impact more than a quarter of the cluster at a time.

Since the head nodes are redundant and not all data nodes are impacted, any jobs that are running during this time might slow down. However, they should still run to completion. Any custom software or local data that you have is preserved across these maintenance events unless a catastrophic failure occurs that requires a cluster rebuild.

IDE options for R Server on an HDInsight cluster

The Linux edge node of an HDInsight cluster is the landing zone for R-based analysis. Recent versions of HDInsight provide a default installation of RStudio Server on the edge node as a browser-based IDE. Use of RStudio Server as an IDE for the development and execution of R scripts can be considerably more productive than just using the R console.

Additionally, you can install a desktop IDE and use it to access the cluster through use of a remote MapReduce or Spark compute context. Options include Microsoft’s R Tools for Visual Studio (RTVS), RStudio, and Walware’s Eclipse-based StatET.

Lastly, you can access the R console on the edge node by typing R at the Linux command prompt after connecting via SSH or PuTY. When using the console interface, it is convenient to run a text editor for R script development in another window, and cut and paste sections of your script into the R console as needed.

Pricing

The prices that are associated with an HDInsight cluster with R Server are structured similarly to the prices for the standard HDInsight clusters. They are based on the sizing of the underlying VMs across the name, data, and edge nodes, with the addition of a core-hour uplift. For more information about HDInsight pricing, see HDInsight pricing.

Next steps

To learn more about how to use R Server on HDInsight clusters, see the following topics: