Tutorial: SQL development for R data scientists

APPLIES TO: yesSQL Server noAzure SQL Database noAzure SQL Data Warehouse noParallel Data Warehouse

In this tutorial for data scientists, learn how to build end-to-end solution for predictive modeling based on R feature support in either SQL Server 2016 or SQL Server 2017. This tutorial uses a NYCTaxi_sample database on SQL Server.

You use a combination of R code, SQL Server data, and custom SQL functions to build a classification model that indicates the probability that the driver might get a tip on a particular taxi trip. You also deploy your R model to SQL Server and use server data to generate scores based on the model.

This example can be extended to all kinds of real-life problems, such as predicting customer responses to sales campaigns, or predicting spending or attendance at events. Because the model can be invoked from a stored procedure, you can easily embed it in an application.

Because the walkthrough is designed to introduce R developers to R Services (In-Database), R is used wherever possible. However, this does not mean that R is necessarily the best tool for each task. In many cases, SQL Server might provide better performance, particularly for tasks such as data aggregation and feature engineering. Such tasks can particularly benefit from new features in SQL Server 2017, such as memory optimized columnstore indexes. We try to point out possible optimizations along the way.

Prerequisites

We recommend that you do this walkthrough on a client workstation. You must be able to connect, on the same network, to a SQL Server computer with SQL Server and the R language enabled. For instructions on workstation configuration, see Set up a data science client for R development.

Alternatively, you can run the walkthrough on a computer that has both SQL Server and an R development environment, but we don't recommend this configuration for a production environment. If you need to put client and server on the same computer, be sure to install a second set of Microsoft R libraries for sending R script from a "remote" client. Do not use the R libraries that are installed in the program files of the SQL Server instance. Specifically, if you are using one computer, you need the RevoScaleR library in both of these locations to support client and server operations.

  • C:\Program Files\Microsoft\R Client\R_SERVER\library\RevoScaleR
  • C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES\library\RevoScaleR

Note

If you are using Machine Learning Server or the Data Science Virtual Machine, instead of R Client, the path to RevoScaleR is C:\Program Files\Microsoft\ML Server\R_SERVER\library\RevoScaleR

Additional R packages

This walkthrough requires several R libraries that are not installed by default as part of R Services (In-Database). You must install the packages both on the client where you develop the solution, and on the SQL Server computer where you deploy the solution.

On a client workstation

In your R environment, copy the following lines and execute the code in a Console window (Rgui or an IDE). Some packages also install required packages. In all, about 32 packages are installed. You must have an internet connection to complete this step.

# Install required R libraries, if they are not already installed.
if (!('ggmap' %in% rownames(installed.packages()))){install.packages('ggmap')}
if (!('mapproj' %in% rownames(installed.packages()))){install.packages('mapproj')}
if (!('ROCR' %in% rownames(installed.packages()))){install.packages('ROCR')}
if (!('RODBC' %in% rownames(installed.packages()))){install.packages('RODBC')}

On the server

You have several options for installing packages on SQL Server. For example, SQL Server provides R package management feature that lets database administrators create a package repository and assign user the rights to install their own packages. However, if you are an administrator on the computer, you can install new packages using R, as long as you install to the correct library.

Note

On the server, do not install to a user library even if prompted. If you install to a user library, the SQL Server instance cannot find or run the packages. For more information, see Installing new R Packages on SQL Server.

  1. On the SQL Server computer, open RGui.exe as an administrator. If you have installed SQL Server R Services using the defaults, Rgui.exe can be found in C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\bin\x64).

  2. At an R prompt, run the following R commands:

install.packages("ggmap", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
install.packages("mapproj", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
install.packages("ROCR", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
install.packages("RODBC", lib=grep("Program Files", .libPaths(), value=TRUE)[1])

This example uses the R grep function to search the vector of available paths and find the path that includes "Program Files". For more information, see https://www.rdocumentation.org/packages/base/functions/grep.

If you think the packages are already installed, check the list of installed packages by running installed.packages().

Next steps