教程:面向 R 数据科学家的 SQL 开发Tutorial: SQL development for R data scientists

适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) 及更高版本适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) and later

本教程主要面向数据科学家,介绍如何基于 SQL Server 2016 或 SQL Server 2017 中的 R 功能支持,生成用于预测建模的端到端解决方案。In this tutorial for data scientists, learn how to build end-to-end solution for predictive modeling based on R feature support in either SQL Server 2016 or SQL Server 2017. 本教程使用 SQL Server 上的 NYCTaxi_sample 数据库。This tutorial uses a NYCTaxi_sample database on SQL Server.

你将结合使用 R 代码、SQL ServerSQL Server 数据和自定义 SQL 函数生成一个分类模型,该模型指示司机在特定出租车行程中可能获得小费的概率。You use a combination of R code, SQL ServerSQL Server data, and custom SQL functions to build a classification model that indicates the probability that the driver might get a tip on a particular taxi trip. 你还会将 R 模型部署到 SQL ServerSQL Server,并使用服务器数据基于该模型生成分数。You also deploy your R model to SQL ServerSQL Server and use server data to generate scores based on the model.

此示例可以扩展到各类现实问题,例如预测客户对销售活动的反应,或预测各个活动的消费或出席率。This example can be extended to all kinds of real-life problems, such as predicting customer responses to sales campaigns, or predicting spending or attendance at events. 因为该模型可以从存储过程进行调用,所以可以轻松地将它嵌入到应用程序中。Because the model can be invoked from a stored procedure, you can easily embed it in an application.

本演练旨在向 R 开发人员介绍 R Services(数据库内)R Services (In-Database),因此会尽可能使用 R。Because the walkthrough is designed to introduce R developers to R Services(数据库内)R Services (In-Database), R is used wherever possible. 但这并不意味着 R 一定是每个任务的最佳工具。However, this does not mean that R is necessarily the best tool for each task. 在许多情况下, SQL ServerSQL Server 可能会提供更好的性能,特别是对于诸如数据聚合和特征工程这类任务。In many cases, SQL ServerSQL Server might provide better performance, particularly for tasks such as data aggregation and feature engineering. 这类任务可以特别受益于 SQL ServerSQL Server中的新功能,如内存优化的列存储索引。Such tasks can particularly benefit from new features in SQL ServerSQL Server, such as memory optimized columnstore indexes. 我们会尽量在此过程中指出可能的优化。We try to point out possible optimizations along the way.

先决条件Prerequisites

建议在客户端工作站上执行此演练。We recommend that you do this walkthrough on a client workstation. 你必须能够在同一网络上连接到启用了 SQL Server 和 R 语言的 SQL ServerSQL Server 计算机。You must be able to connect, on the same network, to a SQL ServerSQL Server computer with SQL Server and the R language enabled. 有关工作站配置的说明,请参阅设置用于 R 开发的数据科学客户端For instructions on workstation configuration, see Set up a data science client for R development.

或者,可以在同时具有 SQL ServerSQL Server 和 R 开发环境的计算机上运行此演练,但我们不建议在生产环境中使用此配置。Alternatively, you can run the walkthrough on a computer that has both SQL ServerSQL Server and an R development environment, but we don't recommend this configuration for a production environment. 如果需要将客户端和服务器放在同一台计算机上,请确保安装另一组 Microsoft R 库,以便从“远程”客户端发送 R 脚本。If you need to put client and server on the same computer, be sure to install a second set of Microsoft R libraries for sending R script from a "remote" client. 请勿使用安装在 SQL Server 实例的程序文件中的 R 库。Do not use the R libraries that are installed in the program files of the SQL Server instance. 具体而言,如果使用一台计算机,则需要在这两个位置都使用 RevoScaleR 库,以支持客户端和服务器操作。Specifically, if you are using one computer, you need the RevoScaleR library in both of these locations to support client and server operations.

  • C:\Program Files\Microsoft\R Client\R_SERVER\library\RevoScaleRC:\Program Files\Microsoft\R Client\R_SERVER\library\RevoScaleR
  • C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES\library\RevoScaleRC:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES\library\RevoScaleR

备注

如果使用的是 Machine Learning ServerData Science Virtual Machine,而不是 R 客户端,则 RevoScaleR 的路径为 C:\Program Files\Microsoft\ML Server\R_SERVER\library\RevoScaleRIf you are using Machine Learning Server or the Data Science Virtual Machine, instead of R Client, the path to RevoScaleR is C:\Program Files\Microsoft\ML Server\R_SERVER\library\RevoScaleR

其他 R 包Additional R packages

本演练需要一些未默认作为 R Services(数据库内)R Services (In-Database)的一部分安装的 R 库。This walkthrough requires several R libraries that are not installed by default as part of R Services(数据库内)R Services (In-Database). 必须将这些包安装到用于开发解决方案的客户端上以及用于部署解决方案的 SQL ServerSQL Server 计算机上。You must install the packages both on the client where you develop the solution, and on the SQL ServerSQL Server computer where you deploy the solution.

在客户端工作站上On a client workstation

在 R 环境中,复制以下行并在控制台窗口(Rgui 或 IDE)中执行代码。In your R environment, copy the following lines and execute the code in a Console window (Rgui or an IDE). 某些包还会安装所需的包。Some packages also install required packages. 总共安装大约 32 个包。In all, about 32 packages are installed. 必须连接 Internet 才能完成此步骤。You must have an internet connection to complete this step.

# Install required R libraries, if they are not already installed.
if (!('ggmap' %in% rownames(installed.packages()))){install.packages('ggmap')}
if (!('mapproj' %in% rownames(installed.packages()))){install.packages('mapproj')}
if (!('ROCR' %in% rownames(installed.packages()))){install.packages('ROCR')}
if (!('RODBC' %in% rownames(installed.packages()))){install.packages('RODBC')}

在服务器上On the server

在 SQL Server 上安装包时有多个选项可选。You have several options for installing packages on SQL Server. 例如,SQL Server 提供 R 包管理功能,让数据库管理员可以创建包存储库,并为用户分配相应权限来安装自己的包。For example, SQL Server provides R package management feature that lets database administrators create a package repository and assign user the rights to install their own packages. 但是,如果你是计算机上的管理员,则可以使用 R 安装新包,只要安装到正确的库中即可。However, if you are an administrator on the computer, you can install new packages using R, as long as you install to the correct library.

备注

在服务器上,即使出现提示,也 不要 安装到用户库。On the server, do not install to a user library even if prompted. 如果安装到用户库,SQL Server 实例将找不到或无法运行包。If you install to a user library, the SQL Server instance cannot find or run the packages. 有关详细信息,请参阅 在 SQL Server 上安装新的 R 包For more information, see Installing new R Packages on SQL Server.

  1. SQL ServerSQL Server 计算机上,以管理员身份 打开 RGui.exe。On the SQL ServerSQL Server computer, open RGui.exe as an administrator. 如果已使用默认设置安装 SQL Server R Services,则可以在 C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\bin\x64 中找到 Rgui.exe。If you have installed SQL Server R Services using the defaults, Rgui.exe can be found in C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\bin\x64).

  2. 在 R 提示符处,运行下面的 R 命令:At an R prompt, run the following R commands:

install.packages("ggmap", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
install.packages("mapproj", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
install.packages("ROCR", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
install.packages("RODBC", lib=grep("Program Files", .libPaths(), value=TRUE)[1])

此示例使用 R grep 函数搜索可用路径的矢量,并查找包含“Program Files”的路径。This example uses the R grep function to search the vector of available paths and find the path that includes "Program Files". 有关详细信息,请参阅 https://www.rdocumentation.org/packages/base/functions/grepFor more information, see https://www.rdocumentation.org/packages/base/functions/grep.

如果认为包已安装,请通过运行 installed.packages() 来查看已安装包的列表。If you think the packages are already installed, check the list of installed packages by running installed.packages().

后续步骤Next steps