Use sparklyr in SQL Server big data cluster

Article
11/18/2022

Applies to: SQL Server 2019 (15.x)

Important

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

Sparklyr provides an R interface for Apache Spark. Sparklyr is a popular way for R developers to use Spark. This article describes how to use sparklyr in a SQL Server 2019 Big Data Clusters using RStudio.

Prerequisites

Deploy a SQL Server 2019 big data cluster.

Install R and RStudio Desktop

Install and configure RStudio Desktop with the following steps:

If you are running on a Windows client, download and install R 3.6.3. Also, download and install RTools 3.5. Make sure to configure RTools binary folder on your PATH environment variable.

Warning

R version 4.x and sparklyr versions other that the one specified below are verified not to work as of SQL Server Big Data Clusters CU13.
Download and install RStudio Desktop. Optionally, all samples work on the R shell.
After installation completes, run the following commands inside of RStudio Desktop or R shell to install the required packages. When asked, confirm to compile packages from source.

install.packages("devtools")
devtools::install_github('rstudio/sparklyr', ref = 'v1.7.0', upgrade = 'always', repos = 'https://cran.microsoft.com/snapshot/2021-06-11/')

Connect to Spark in a big data cluster

You can use sparklyr to connect from a client to the big data cluster using Livy and the HDFS/Spark gateway.

In RStudio, create an R script and connect to Spark as in the following example:

Tip

For the <AZDATA_USERNAME> and <AZDATA_PASSWORD> values, use the username and password you set during the big data cluster deployment.

Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU 5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.

For the <IP> and <PORT> values, see the documentation on connecting to a big data cluster.

library(sparklyr)
library(dplyr)
library(DBI)

#Specify the Knox username and password
config <- livy_config(user = "<AZDATA_USERNAME>", password = "<AZDATA_PASSWORD>")

httr::set_config(httr::config(ssl_verifypeer = 0L, ssl_verifyhost = 0L))

sc <- spark_connect(master = "https://<IP>:<PORT>/gateway/default/livy/v1",
                    method = "livy",
                    config = config)

Run sparklyr queries

After connecting to Spark, you can run sparklyr. The following example performs a query on the iris dataset using sparklyr:

iris_tbl <- copy_to(sc, iris)

iris_count <- dbGetQuery(sc, "SELECT COUNT(*) FROM iris")

iris_count

Distributed R computations

One feature of sparklyr is the ability to distribute R computations with spark_apply.

Because big data clusters use Livy connections, you must set packages = FALSE in the call to spark_apply. For more information, see the Livy section of the sparklyr documentation on distributed R computations. With this setting, you can only use the R packages that are already installed on your Spark cluster in the R code passed to spark_apply. The following example demonstrates this functionality:

iris_tbl %>% spark_apply(function(e) nrow(e), names = "nrow", group_by = "Species", packages = FALSE)

Next steps

For more information about big data clusters, see What are SQL Server 2019 Big Data Clusters.