Use Sparklyr in SQL Server 2019 Big data cluster
Sparklyr provides an R interface for Apache Spark. Sparklyr is the preffered way for R developers to use Spark. This article describes how to use sparklyr in a SQL Server 2019 big data cluster (preview) using RStudio.
Connect to spark in SS19 Big Data cluster
In RStudio create a RScript and connect to the Spark as follows. Spark Big data cluster connects through Livy, which can be reached with the HDFS/Spark gateway. For authentication, use the username and password you set during the deployment.
library(sparklyr) library(dplyr) library(DBI) #Specify the Knox username and password config <- livy_config(user = "***root***", password = "****") httr::set_config(httr::config(ssl_verifypeer = 0L)) sc <- spark_connect(master = "https://<IP>:<PORT>/gateway/default/livy/v1", method = "livy", config = config)
Run sparklyr queries
After connecting to Spark, you can run sparklyr. The following example performs a query on iris dataset using sparklyr:
copy_to(sc, iris) iris_count <- dbGetQuery(sc, "SELECT COUNT(*) FROM iris") iris_count
For more information about big data clusters, see What are SQL Server 2019 big data clusters?.