在 SQL Server 中通过 R 进行数据浏览和预测性建模Data exploration and predictive modeling with R in SQL Server

适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) 及更高版本适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) and later

本文介绍通过与 SQL Server 集成可以实现的数据科学过程的改进。This article describes improvements to the data science process that are possible through integration with SQL Server.

数据科学过程The Data Science Process

数据科学家经常使用 R 来浏览数据并构建预测模型。Data scientists often use R to explore data and build predictive models. 此过程通常需要反复经历各种尝试和错误才能获得良好的预测模型。This is typically an iterative process of trial and error until a good predictive model is reached. 有经验的数据科学家会使用 RODBC 包连接到 SQL ServerSQL Server 数据库并将数据提取到本地工作站,然后对数据进行浏览,最后再使用标准 R 包构建预测模型。As an experienced data scientist, you might connect to the SQL ServerSQL Server database and fetch the data to your local workstation using the RODBC package, explore your data, and build a predictive model using standard R packages.

但是,这种方法有很多缺点,这影响了 R 在企业中的广泛采用。However, this approach has many drawbacks, that hae hindered the wider adoption of R in the enterprise.

  • 数据移动可能很慢、效率低下或不安全Data movement can be slow, inefficient, or insecure
  • R 本身存在性能和缩放限制R itself has performance and scale limitations

当你需要移动和分析大量数据,或者使用的数据集不适合计算机上可供使用的内存时,这些缺点就会变得更明显。These drawbacks become more apparent when you need to move and analyze large amounts of data, or use data sets that don't fit into the memory available on your computer.

你可以使用 R Services(数据库内)R Services (In-Database) 随附的新的可缩放包和 R 函数来克服这些困难。The new, scalable packages and R functions included with R Services(数据库内)R Services (In-Database) help you overcome many of these challenges.

RevoScaleR 有什么不同?What's Different about RevoScaleR?

RevoScaleR 包包含部分最常用 R 函数的实现,这些实现在经过重新设计后可提供并行度和缩放功能。The RevoScaleR package contains implementations of some of the most popular R functions, which have been redesigned to provide parallelism and scale. 有关详细信息,请参阅使用 RevoScaleR 进行分布式计算For more information, see Distributed Computing using RevoScaleR.

RevoScaleR 包还允许更改 执行上下文The RevoScaleR package also provides support for changing execution context. 这意味着,不管是完整解决方案还是单个函数,都应使用托管 SQL ServerSQL Server 实例的计算机的资源进行计算,而不应使用本地工作站。What this means is that, for an entire solution or for just one function, you can indicate that computations should be performed using the resources of the computer that hosts the SQL ServerSQL Server instance, rather than your local workstation. 这样做有很多好处:避免不必要的数据移动,并可利用服务器计算机上更强大的计算资源。There are multiple advantages to doing this: you avoid unnecessary data movement, and you can leverage greater computation resources on the server computer.

R 环境和包R Environment and Packages

R Services(数据库内)R Services (In-Database) 中支持的 R 环境包含运行时、开源语言,以及多个包所支持和扩展的图形引擎。The R environment supported in R Services(数据库内)R Services (In-Database) consists of a runtime, the open source language, and a graphical engine supported and extended by multiple packages. 该语言允许使用通过包实现的多种扩展。The language allows a variety of extensions that are implemented using packages.

使用其他 R 包Using Other R Packages

除 Microsoft 机器学习附带的专有 R 库之外,你几乎可以在解决方案中使用任何 R 包,包括:In addition to the proprietary R libraries included with Microsoft Machine Learning, you can use almost any R packages in your solution, including:

  • 公共存储库提供的通用 R 包。General purpose R packages from public repositories. 你可以从公共存储库(例如 CRAN,托管 6000 多个可供数据科学家使用的包)获取最常用的开源 R 包。You can obtain the most popular open source R packages from public repositories such as CRAN, which hosts has over 6000 packages that can be used by data scientists.

    对于 Windows 平台,R 包以 zip 文件形式提供,可以在获得 GPL 许可的情况下下载和安装。For the Windows platform, R packages are provided as zip files and can be downloaded and installed under the GPL license.

    有关如何安装适用于 R Services(数据库内)R Services (In-Database)的第三方包的信息,请参阅 Install Additional R Packages on SQL ServerFor information about how to install third-party packages for use with R Services(数据库内)R Services (In-Database), see Install Additional R Packages on SQL Server

  • R Services(数据库内)R Services (In-Database)提供的其他包和库。Additional packages and libraries provided by R Services(数据库内)R Services (In-Database).

    RevoScaleR 包包括高性能大数据分析、支持常用数据科学任务的改进版函数、适用于 Naive Bayes 的优化学习模型、线性回归、时序模型、神经网络以及高级数学库。The RevoScaleR package includes high performance big data analytics, improved versions of functions that support common data science tasks, optimized learners for Naive Bayes, linear regression, time series models, and neural networks, and advanced math libraries.

    RevoPemaR 包允许你在 R 中开发自己的并行外部存储器算法。The RevoPemaR package lets you develop your own parallel external memory algorithms in R.

    有关这些包及其使用方法的详细信息,请参阅什么是 RevoScaleR RevoPemaR 入门For more information about these packages and how to use them, see What is RevoScaleR and Get started with RevoPemaR.

  • MicrosoftML 包含 Microsoft 数据科学团队提供的高度优化的机器学习算法和数据转换的集合 。MicrosoftML contains a collection of highly optimized machine learning algorithms and data transformations from the Microsoft Data Science team. Azure 机器学习还使用了许多算法。Many of the algorithms are also used in Azure Machine Learning. 有关详细信息,请参阅 SQL Server 中的 MicrosoftMLFor more information, see MicrosoftML in SQL Server.

R 开发工具R Development Tools

开发 R 解决方案时,请确保下载 Microsoft R Client。When developing your R solution, be sure to download Microsoft R Client. 此免费下载包括支持远程计算上下文和可缩放算法所需的库:This free download includes the libraries needed to support remote compute contexts and scalable alorithms:

  • Microsoft R OpenMicrosoft R Open R 运行时的一个分发版以及一组包(例如 Intel 数学内核库),用于提升标准 R 操作的性能。Microsoft R OpenMicrosoft R Open: A distribution of the R runtime and a set of packages, such as the Intel math kernel library, that boost the performance of standard R operations.

  • RevoScaleR: 一个可用来将计算推送到 SQL ServerSQL Server 实例的 R 包。RevoScaleR: An R package that lets you push computations to an instance of SQL ServerSQL Server. Microsoft R EnterpriseMicrosoft R Enterprise 列中的一个值匹配。. 它还包括一组常用 R 函数,这些函数在重新设计后具有更好的性能和可伸缩性。It also includes a set of common R functions that have been redesigned to provide better performance and scalability. 你可以通过 rx 前缀来标识这些性能已改善的函数。You can identify these improved functions by the rx prefix. 它还包括了针对各种源的增强数据提供程序;这些函数具有前缀 RxIt also includes enhanced data providers for a variety of sources; these functions are prefixed with Rx.

你可以使用支持 R 的任何基于 Windows 的代码编辑器,例如 用于 Visual Studio 的 R 工具R Tools for Visual Studio 或 RStudio。You can use any Windows-based code editor that supports R, such as 用于 Visual Studio 的 R 工具R Tools for Visual Studio or RStudio. Microsoft R OpenMicrosoft R Open 的下载包还包括 R 的常用命令行工具,例如 RGui.exe。The download of Microsoft R OpenMicrosoft R Open also includes common command-line tools for R such as RGui.exe.

使用新数据源和计算上下文Use New Data Sources and Compute Contexts

通过 RevoScaleR 包连接到 SQL ServerSQL Server 时,请查看在 R 代码中使用的这三个函数:When using the RevoScaleR package to connect to SQL ServerSQL Server, look for these functions to use in your R code:

  • RxSqlServerData 是 RevoScaleR 包中提供的函数,支持通过改进型数据连接来连接到 SQL ServerSQL Server随附的新的可伸缩包和 R 函数来克服这些困难。RxSqlServerData is a function provided in the RevoScaleR package to support improved data connectivity to SQL ServerSQL Server.

    可以在 R 代码中使用此函数来定义 数据源You use this function in your R code to define the data source. 数据源对象指定数据所在的服务器和表,并管理在 SQL ServerSQL Server中读写数据的任务。The data source object specifies the server and tables where the data resides and manages the task of reading data from and writing data to SQL ServerSQL Server.

  • RxInSqlServer 函数可用来指定计算上下文。 The RxInSqlServer function can be used to specify the compute context. 换言之,你可以指定执行 R 代码的位置:本地工作站或托管 SQL ServerSQL Server 实例的计算机。In other words, you can indicate where the R code should be executed: on your local workstation, or on the computer that hosts the SQL ServerSQL Server instance. 有关详细信息,请参阅 RevoScaleR 函数For more information, see RevoScaleR Functions.

    当你设置计算上下文时,该上下文仅影响支持远程执行上下文的计算,即 RevoScaleR 包及相关函数提供的 R 操作。When you set the compute context, it affects only computations that support remote execution context, which means R operations provided by the RevoScaleR package and related functions. 通常,基于标准 CRAN 包的 R 解决方案不能在远程计算上下文中运行,但如果是由 T-SQL 启动的,则它们可以在 SQL ServerSQL Server 计算机上运行。Typically, R solutions based on standard CRAN packages cannot run in a remote compute context, though they can be run on the SQL ServerSQL Server computer if started by T-SQL. 不过,你可以使用 rxExec 函数调用各个 R 函数并在 SQL ServerSQL Server 中远程运行它们。However, you can use the rxExec function to call individual R functions and run them remotely in SQL ServerSQL Server.

有关如何创建和使用数据源和执行上下文的示例,请参阅以下教程:For examples of how to create and work with data sources and execution contexts, see these tutorials:

将 R 代码部署到生产环境Deploy R Code to Production

数据科学的一个重要部分是将你的分析提供给他人,或者使用预测模型来改善业务成果或流程。An important part of data science is providing your analyses to others, or using predictive models to improve business results or processes. R Services(数据库内)R Services (In-Database)中,当 R 脚本或模型就绪以后,可以轻松地转移到生产。In R Services(数据库内)R Services (In-Database), it is easy to move to production when your R script or model is ready.

有关如何将代码转移到 SQL ServerSQL Server的第三方包的信息,请参阅 Operationalizing Your R Code随附的新的可伸缩包和 R 函数来克服这些困难。For more information about how you can move your code to run in SQL ServerSQL Server, see Operationalizing Your R Code.

通常情况下,在开始部署时会对脚本进行清除,去掉生产过程中不需要的脚本。Typically the deployment process begins with cleaning up your script to eliminate code that is not needed in production. 将计算移近数据时,可能会找到比 R 中的所有操作更有效地移动、汇总或呈现数据的方法。建议数据科学家就改善性能的方法咨询数据库开发人员,尤其是在解决方案进行可能在 SQL 中更有效的数据清理或功能设计时。As you move computations closer to the data, you might find ways to more efficiently move, summarize, or present data than doing everything in R. We recommend that the data scientist consult with a database developer about ways to improve performance, especially if the solution does data cleansing or feature engineering that might be more effective in SQL. 可能需要更改 ETL 流程来确保模型的构建或评分工作流不会失败,并可能需要以正确格式提供输入数据。Changes to ETL processes might be needed to ensure that workflows for building or scoring a model don't fail, and that input data is available in the right format.

另请参阅See Also

Base R 函数和 RevoScaleR 函数之对比Comparison of Base R and RevoScaleR Functions

SQL Server 中的 RevoScaleR 库RevoScaleR library in SQL Server