Use Sparklyr in SQL Server 2019 Big data cluster

Sparklyr provides an R interface for Apache Spark. Sparklyr is the preffered way for R developers to use Spark. This article describes how to use sparklyr in a SQL Server 2019 big data cluster (preview) using RStudio.

Prerequisites

Connect to spark in SS19 Big Data cluster

In RStudio create a RScript and connect to the Spark as follows. Spark Big data cluster connects through Livy, which can be reached with the HDFS/Spark gateway. For authentication, use the username and password you set during the deployment.

library(sparklyr)
library(dplyr)
library(DBI)

#Specify the Knox username and password
config <- livy_config(user = "***root***", password = "****")

httr::set_config(httr::config(ssl_verifypeer = 0L))

sc <- spark_connect(master = "https://<IP>:<PORT>/gateway/default/livy/v1",
                    method = "livy",
                    config = config)

Run sparklyr queries

After connecting to Spark, you can run sparklyr. The following example performs a query on iris dataset using sparklyr:

copy_to(sc, iris)

iris_count <- dbGetQuery(sc, "SELECT COUNT(*) FROM iris")

iris_count

Next steps

For more information about big data clusters, see What are SQL Server 2019 big data clusters?.