Data Science Deep Dive: Using the RevoScaleR Packages
This tutorial demonstrates how to use the enhanced R packages provided in R Services (In-Database) to work with SQL Server data and create scalable R solutions, by using the server as a compute context for high-performance big data analytics.
You will learn how to create a remote compute context, move data between local and remote compute contexts, and execute R code on a remote SQL Server. You will also learn how to analyze and plot data both locally and on the remote server, and how to create and deploy models.
This tutorial works specifically with SQL Server data on Windows, and uses in-database compute contexts. If you want to use R in other contexts, such as Teradata, Linux, or Hadoop, see these Microsoft R Server tutorials:
To illustrate the flexibility and processing power of the ScaleR packages, in this tutorial you'll move data and swap compute contexts frequently.
- Data is initially obtained from CSV files or XDF files. You'll import the data into SQL Server using the functions in the RevoScaleR package.
- Model training and scoring will be performed in the SQL Server compute context. You'll create new SQL Server tables, using the rx functions, to save your scoring results.
- You'll create plots both on the server and in the local compute context.
- To train the model, you will use data already stored in a SQL Server database. All computations will be performed on the SQL Server instance.
- You'll extract a subset of data and save it as an XDF file for re-use in analysis on your local workstation.
- New data used during the scoring process is extracted from the SQL Server database using an ODBC connection. All computations are performed on the local workstation.
- Finally, you'll perform a simulation based on a custom R function, using the server compute context.
Get Started Now
This tutorial takes about 75 minutes to complete, not including setup.
- Work with SQL Server Data using R
- Create SQL Server Data Objects using RxSqlServerData
- Query and Modify the SQL Server Data
- Define and Use Compute Contexts
- Create and Run R Scripts
- Visualize SQL Server Data using R
- Create R Models
- Score New Data
- Transform Data Using R
- Load Data into Memory using rxImport
- Create New SQL Server Table using rxDataStep
- Perform Chunking Analysis using rxDataStep
- Analyze Data in Local Compute Context;
- Move Data between SQL Server and XDF File
- Create a Simple Simulation
This tutorial is intended for data scientists or for people who are already somewhat familiar with R and data science tasks including exploration, statistical analysis, and model tuning. However, all the code is provided, so even if you are new to R, you can easily run the code and follow along, assuming you have the required server and client environments.
You should also be comfortable with Transact-SQL syntax and know how to access a SQL Server database using SQL Server Management Studio or other database tools, such as Visual Studio.
Save your R workspace between lessons, so that you can easily pick up where you left off.
SQL Server with support for R
Install SQL Server 2016 and enable R Services (in-Database). Or, install SQL Server 2017 and enable Machine Learning Services and choose the R language. The setup process is described in SQL Server 2016 Books Online.
To run the queries used to train the model, you must have db_datareader privileges on the database where the data is stored. To run R, your user must have the permission, EXECUTE ANY EXTERNAL SCRIPT.
Data science development computer
You must also install the RevoScaleR packages and related providers in your R development environment. The easiest way to do this is to install Microsoft R Client or Microsoft R Server (Standalone). For more information, see Set Up a Data Science Client
Other versions of Revolution R Enterprise or Revolution R Open are not supported.
An open source distribution of R, such as R 3.2.2, will not work in this tutorial, because only the RevoScaleR functions can use remote compute contexts.
Additional R Packages
For this tutorial, you will need to install the following packages: dplyr, ggplot2, ggthemes, reshape2, and e1071. Instructions are provided as part of the tutorial.
All packages must be installed in two places: on the computer you use for R solution development, and on the SQL Server computer where R scripts will be run. If you do not have permission to install packages on the server computer, ask an administrator. Do not install the packages to a user library. It is important that the packages be installed in the R package library that is used by the SQL Server instance.
For more information, see Prerequisites for Data Science Walkthroughs.