Use sparklyr in SQL Server big data cluster
Sparklyr provides an R interface for Apache Spark. Sparklyr is a popular way for R developers to use Spark. This article describes how to use sparklyr in a SQL Server 2019 big data cluster (preview) using RStudio.
Install RStudio Desktop
Install and configure RStudio Desktop with the following steps:
If you are running on a Windows client, download and install R 3.4.4.
After installation completes, run the following commands inside of RStudio Desktop to install the required packages:
install.packages("DBI", repos = "https://cran.microsoft.com/snapshot/2019-01-01") install.packages("dplyr", repos = "https://cran.microsoft.com/snapshot/2019-01-01") install.packages("sparklyr", repos = "https://cran.microsoft.com/snapshot/2019-01-01")
Connect to Spark in a big data cluster
You can use sparklyr to connect from a client to the big data cluster using Livy and the HDFS/Spark gateway.
In RStudio, create an R script and connect to Spark as in the following example:
<PASSWORD> values, use the username (such as root) and password you set during the big data cluster deployment. For the
<PORT> values, see the documentation on connecting to a big data cluster.
library(sparklyr) library(dplyr) library(DBI) #Specify the Knox username and password config <- livy_config(user = "<username>", password = "<password>") httr::set_config(httr::config(ssl_verifypeer = 0L, ssl_verifyhost = 0L)) sc <- spark_connect(master = "https://<IP>:<PORT>/gateway/default/livy/v1", method = "livy", config = config)
Run sparklyr queries
After connecting to Spark, you can run sparklyr. The following example performs a query on iris dataset using sparklyr:
iris_tbl <- copy_to(sc, iris) iris_count <- dbGetQuery(sc, "SELECT COUNT(*) FROM iris") iris_count
Distributed R computations
Because big data clusters use Livy connections, you must set
packages = FALSE in the call to spark_apply. For more information, see the Livy section of the sparklyr documentation on distributed R computations. With this setting, you can only use the R packages that are already installed on your Spark cluster in the R code passed to spark_apply. The following example demonstrates this functionality:
iris_tbl %>% spark_apply(function(e) nrow(e), names = "nrow", group_by = "Species", packages = FALSE)
For more information about big data clusters, see What are SQL Server 2019 big data clusters.