Data science deep dive: Using the RevoScaleR packages with SQL Server
This tutorial demonstrates how to use the enhanced R packages provided in R Services (In-Database) to work with SQL Server data and create scalable R solutions, by using the server as a compute context for high-performance big data analytics.
You learn how to create a remote compute context, move data between local and remote compute contexts, and execute R code on a remote SQL Server. You also learn how to analyze and plot data both locally and on the remote server, and how to create and deploy models.
This tutorial works specifically with SQL Server data on Windows, and uses in-database compute contexts. If you want to use R in other contexts, such as Teradata, Linux, or Hadoop, see these Microsoft R Server tutorials:
To experience the flexibility and processing power of the RevoScaleR package, in this tutorial you move data and swap compute contexts frequently. To illustrate, here are some of the tasks in this tutorial:
- Data is initially obtained from CSV files or XDF files. You import the data into SQL Server using the functions in the RevoScaleR package.
- Model training and scoring is performed using the SQL Server compute context.
- Use RevoScaleR functions to create new SQL Server tables to save your scoring results.
- Create plots both on the server and in the local compute context.
- Train a model on data in SQL Server database, running R in the SQL Server instance.
- Extract a subset of data and save it as an XDF file for re-use in analysis on your local workstation.
- Get new data for scoring, by opening an ODBC connection to the SQL Server database. Scoring is done on the local workstation.
- Create a custom R function and run it in the server compute context to perform a simulation.
Article list and time required
This tutorial takes about 75 minutes to complete, not including setup.
- Work with SQL Server data using R
- Create SQL Server data objects using RxSqlServerData
- Query and modify the SQL Server data
- Define and use compute contexts
- Create and run R Scripts
- Visualize SQL Server Data using R
- Create R models
- Score new data
- Transform data using R
- Load Data into Memory using rxImport
- Create new SQL Server tables using rxDataStep
- Perform chunking analysis using rxDataStep
- Analyze data in local compute context
- Move data from SQL Server using XDF files
- Create a simple simulation
This tutorial is intended for data scientists or for people who are already somewhat familiar with R, and with data science tasks such as summaries and model creation. However, all the code is provided, so even if you are new to R, you can run the code and follow along, assuming you have the required server and client environments.
You should also be comfortable with Transact-SQL syntax and know how to access a SQL Server database using tools such as these:
- SQL Server Management Studio
- Database tools in Visual Studio
- The free mssql extension for Visual Studio Code.
Save your R workspace between lessons, so that you can easily pick up where you left off.
SQL Server with support for R
Install SQL Server 2016 and enable R Services (in-Database). Or, install SQL Server 2017 and enable Machine Learning Services and choose the R language.
To run the queries used to train the model, you must have db_datareader privileges on the database where the data is stored. To run R, your user must have the permission, EXECUTE ANY EXTERNAL SCRIPT.
Data science development computer
You must install the RevoScaleR packages and related providers on the computer used as the R development environment. The easiest way to do this is to install Microsoft R Client, Microsoft R Server (Standalone), or Machine Learning Server (Standalone).
Other versions of Revolution R Enterprise or Revolution R Open are not supported.
An open source distribution of R cannot be used in this tutorial, because only the RevoScaleR functions can use remote compute contexts.
Additional R Packages
In this tutorial, you install the following packages: dplyr, ggplot2, ggthemes, reshape2, and e1071. Instructions are provided as part of the tutorial.
All packages must be installed in two places: on the computer you use for R solution development, and on the SQL Server computer where R scripts are executed. If you do not have permission to install packages on the server computer, ask an administrator.
Do not install the packages to a user library. The packages must be installed in the R package library that is used by the SQL Server instance.
For more information, see Prerequisites for Data Science Walkthroughs.