转换 R 代码以在 SQL Server(数据库内)实例中执行Convert R code for execution in SQL Server (In-Database) instances

适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) 及更高版本 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) and later 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance

本文提供了有关如何修改 R 代码以使其可在 SQL Server 中运行的高级指南。This article provides high-level guidance on how to modify R code to work in SQL Server.

将 R 代码从 R Studio 或其他环境移到 SQL Server 时,大多数情况下代码不需要进一步修改即可运行,比如说代码很简单时(例如要求某些输入并返回值的函数)。When you move R code from R Studio or another environment to SQL Server, most often the code works without further modification: for example, if the code is simple, such as a function that takes some inputs and returns a value. 此外,移植使用 RevoScaleR 或 MicrosoftML 包的解决方案也更容易,这些包支持在不同的执行上下文中运行(只需稍加修改) 。It is also easier to port solutions that use the RevoScaleR or MicrosoftML packages, which support execution in different execution contexts with minimal changes.

但是,如果以下任一条件成立,则你可能需要对代码进行大量更改:However, your code might require substantial changes if any of the following apply:

  • 使用访问网络或无法在 SQL Server 上安装的 R 库。You use R libraries that access the network or that cannot be installed on SQL Server.
  • 代码对 SQL Server 之外的数据源进行单独调用,如 Excel 工作表、共享位置上的文件和其他数据库。The code makes separate calls to data sources outside SQL Server, such as Excel worksheets, files on shares, and other databases.
  • 要运行 sp_execute_external_script 的 @script 参数中的代码,并同时参数化存储过程 。You want to run the code in the @script parameter of sp_execute_external_script and also parameterize the stored procedure.
  • 最初的解决方案包含多个在生产环境单独执行时效率可能更高的步骤,如数据准备、特征工程与模型定型、评分或报告。Your original solution includes multiple steps that might be more efficient in a production environment if executed independently, such as data preparation or feature engineering vs. model training, scoring, or reporting.
  • 想要通过更改库、使用并行执行或将某些处理任务卸载到 SQL Server 来优化性能。You want to optimize performance by changing libraries, using parallel execution, or offloading some processing to SQL Server.

步骤 1。Step 1. 规划要求和资源Plan requirements and resources

Packages

  • 确定所需的包,并确保它们可用于 SQL Server 上。Determine which packages are needed and ensure that they work on SQL Server.

  • 预先在机器学习服务使用的默认包库中安装包。Install packages in advance, in the default package library used by Machine Learning Services. 用户库不受支持。User libraries are not supported.

数据源Data sources

  • 如果要在 sp_execute_external_script 中嵌入 R 代码,则标识主数据源和辅助数据源。If you intend to embed your R code in sp_execute_external_script, identify primary and secondary data sources.

    • 主数据源是大型数据集,例如模型定型数据或用于预测的输入数据 。Primary data sources are large datasets, such as model training data, or input data for predictions. 计划将最大的数据集映射到 sp_execute_external_script 的输入参数。Plan to map your largest dataset to the input parameter of sp_execute_external_script.

    • 辅助数据源通常是较小的数据集,例如系数列表或其他分组变量 。Secondary data sources are typically smaller data sets, such as lists of factors, or additional grouping variables.

    目前,sp_execute_external_script 仅支持单个数据集作为存储过程的输入。Currently, sp_execute_external_script supports only a single dataset as input to the stored procedure. 但是,可以添加多个标量或二进制输入。However, you can add multiple scalar or binary inputs.

    前面带有 EXECUTE 的存储过程调用不能用作 sp_execute_external_script 的输入。Stored procedure calls preceded by EXECUTE cannot be used as an input to sp_execute_external_script. 你可以使用查询、视图或任何其他有效的 SELECT 语句。You can use queries, views, or any other valid SELECT statement.

  • 确定所需的输出。Determine the outputs you need. 如果使用 sp_execute_external_script 运行 R 代码,存储过程只能输出一个数据帧作为结果。If you run R code using sp_execute_external_script, the stored procedure can output just one data frame as a result. 但是,也可以输出多个标量输出,包括二进制格式的绘图和模型,以及从 R 代码或 SQL 参数得到的其他标量值。However, you can also output multiple scalar outputs, including plots and models in binary format, as well as other scalar values derived from R code or SQL parameters.

数据类型Data types

  • 针对可能的数据类型问题创建一份查检表。Make a checklist of possible data type issues.

    SQL Server 机器学习服务支持所有 R 数据类型。All R data types are supported by SQL Server machine Learning Services. 但是,SQL ServerSQL Server 支持的数据类型比 R 更广泛。所以,在将 SQL ServerSQL Server 数据发送到 R 时,会执行某些隐式数据类型转换,反之亦然。However, SQL ServerSQL Server supports a greater variety of data types than does R. Therefore, some implicit data type conversions are performed when sending SQL ServerSQL Server data to R, and vice versa. 可能需要显式强制转换或转换某些数据。You might need to explicitly cast or convert some data.

    支持 NULL 值。NULL values are supported. 但是,R 使用 na 数据构造来表示缺失值(类似于 null 值)。However, R uses the na data construct to represent a missing value, which is similar to a null.

  • 考虑消除对 R 不能使用的数据的依赖:例如,SQL Server 的 rowid 和 GUID 数据类型不能由 R 使用,会生成错误。Consider eliminating dependency on data that cannot be used by R: for example, rowid and GUID data types from SQL Server cannot be consumed by R and generate errors.

    有关详细信息,请参阅 R 库和数据类型For more information, see R Libraries and Data Types.

步骤 2.Step 2. 转换或重新打包代码Convert or repackage code

更改代码的程度取决于你是要从远程客户端提交 R 代码以在 SQL Server 计算上下文中运行,还是要将代码部署为存储过程的一部分以获得更好的性能和数据安全性。How much you change your code depends on whether you intend to submit the R code from a remote client to run in the SQL Server compute context, or intend to deploy the code as part of a stored procedure, which can provide better performance and data security. 将代码包装在存储过程中会有一些额外的要求。Wrapping your code in a stored procedure imposes some additional requirements.

  • 为避免数据移动,应尽可能将主输入数据定义为 SQL 查询。Define your primary input data as a SQL query wherever possible, to avoid data movement.

  • 在存储过程中运行 R 时,可以传递多个标量输入 。When running R in a stored procedure, you can pass through multiple scalar inputs. 对于要在输出中使用的任何参数,请添加 OUTPUT 关键字 。For any parameters that you want to use in the output, add the OUTPUT keyword.

    例如,下面的标量输入 @model_name 包含模型名称,在结果中该模型名称也是该列自身中的输出:For example, the following scalar input @model_name contains the model name, which is also output in its own column in the results:

    EXEC sp_execute_external_script @model_name="DefaultModel" OUTPUT, @language=N'R', @script=N'R code here'
    
  • 作为 sp_execute_external_script 存储过程的参数传入的任何变量必须映射到 R 代码中的变量。Any variables that you pass in as parameters of the stored procedure sp_execute_external_script must be mapped to variables in the R code. 默认情况下,变量按名称映射。By default, variables are mapped by name.

    输入数据集中的所有列必须也映射到 R 脚本中的变量。All columns in the input dataset must also be mapped to variables in the R script. 例如,假设 R 脚本包含如下所示的公式:For example, assume your R script contains a formula like this one:

    formula <- ArrDelay ~ CRSDepTime + DayOfWeek + CRSDepHour:DayOfWeek
    

    如果输入数据集不包含具有匹配名称 ArrDelay、CRSDepTime、DayOfWeek、CRSDepHour 和 DayOfWeek 的列,则会引发错误。An error is raised if the input dataset does not contain columns with the matching names ArrDelay, CRSDepTime, DayOfWeek, CRSDepHour, and DayOfWeek.

  • 在某些情况下,必须提前为结果定义输出架构。In some cases, an output schema must be defined in advance for the results.

    例如,若要将数据插入到表中,则必须使用 WITH RESULT SET 子句来指定架构 。For example, to insert the data into a table, you must use the WITH RESULT SET clause to specify the schema.

    如果 R 脚本使用 @parallel=1 参数,则还需要输出架构。The output schema is also required if the R script uses the argument @parallel=1. 原因是 SQL Server 可能会创建多个进程来并行运行查询,最后收集结果。The reason is that multiple processes might be created by SQL Server to run the query in parallel, with the results collected at the end. 因此,必须先准备输出架构,然后才能创建并行进程。Therefore, the output schema must be prepared before the parallel processes can be created.

    在其他情况下,你可以通过使用“WITH RESULT SETS UNDEFINED”选项来省略结果架构 。In other cases, you can omit the result schema by using the option WITH RESULT SETS UNDEFINED. 此语句从 R 脚本返回数据集,不需命名列或指定 SQL 数据类型。This statement returns the dataset from the R script without naming the columns or specifying the SQL data types.

  • 考虑使用 T-SQL(而不使用 R)生成计时或跟踪数据。Consider generating timing or tracking data using T-SQL rather than R.

    例如,你可以通过添加传递给结果的 T-SQL 调用来传递用于审核和存储的系统时间或其他信息,而不是在 R 脚本中生成类似的数据。For example, you could pass the system time or other information used for auditing and storage by adding a T-SQL call that is passed through to the results, rather than generating similar data in the R script.

提高性能和安全性Improve performance and security

  • 避免将预测或中间结果写入文件。Avoid writing predictions or intermediate results to file. 改为将预测写入表,以避免数据移动。Write predictions to a table instead, to avoid data movement.
  • 提前运行所有查询,并检查 SQL Server 查询计划以确定可以并行执行的任务。Run all queries in advance, and review the SQL Server query plans to identify tasks that can be performed in parallel.

    如果输入查询可并行化,请将 @parallel=1 作为参数的一部分设置为 sp_execute_external_scriptIf the input query can be parallelized, set @parallel=1 as part of your arguments to sp_execute_external_script.

    只要 SQL Server 可以处理分区表或者在多个进程之间分布查询并最终聚合结果,就通常可以使用此标志进行并行处理。Parallel processing with this flag is typically possible any time that SQL Server can work with partitioned tables or distribute a query among multiple processes and aggregate the results at the end. 如果用于训练模型的算法需要读取所有数据,或者你需要创建聚合,则通常无法使用此标志进行并行处理。Parallel processing with this flag is typically not possible if you are training models using algorithms that require all data to be read, or if you need to create aggregates.

  • 检查 R 代码,确定是否可以独立执行某些步骤,或者可以使用单独的存储过程调用来更有效地执行某些步骤。Review your R code to determine if there are steps that can be performed independently, or performed more efficiently, by using a separate stored procedure call. 例如,通过分别执行特征工程或特征提取,并将值保存到表中,可以获得更好的性能。For example, you might get better performance by doing feature engineering or feature extraction separately, and saving the values to a table.

  • 尝试使用 T-SQL 而不是使用 R 代码进行基于集的计算的方法。Look for ways to use T-SQL rather than R code for set-based computations.

    例如,该 R 解决方案显示用户定义的 T-SQL 函数和 R 如何执行相同的特征工程任务:数据科学端到端演练For example, this R solution shows how user-defined T-SQL functions and R can perform the same feature engineering task: Data Science End-to-End Walkthrough.

  • 如有可能,请将传统的 R 函数替换为支持分布式执行的 ScaleR 函数 。If possible, replace conventional R functions with ScaleR functions that support distributed execution. 有关详细信息,请参阅 Comparison of Base R and Scale R Functions(比较 Base R 函数和 Scale R 函数)。For more information, see Comparison of Base R and Scale R Functions.

  • 咨询数据库开发人员,确定通过 SQL Server 功能提高性能的方式,例如内存优化表Resource Governor(如果你拥有 Enterprise Edition 的话)。Consult with a database developer to determine ways to improve performance by using SQL Server features such as memory-optimized tables, or, if you have Enterprise Edition, Resource Governor).

步骤 3.Step 3. 准备部署Prepare for deployment

  • 通知管理员,以便在部署代码之前可以安装和测试包。Notify the administrator so that packages can be installed and tested in advance of deploying your code.

    在开发环境中,也许可以将包安装为代码的一部分,但在生产环境中,这是一种不好的做法。In a development environment, it might be okay to install packages as part of your code, but this is a bad practice in a production environment.

    无论是使用存储过程,还是在 SQL Server 计算上下文中运行 R 代码,都不支持用户库。User libraries are not supported, regardless of whether you are using a stored procedure or running R code in the SQL Server compute context.

在存储过程中打包 R 代码Package your R code in a stored procedure

  • 如果代码相对简单,可以将它嵌入用户定义的 T-SQL 函数中,而无需修改,如此示例中所述:If your code is relatively simple, you can embed it in a T-SQL user-defined function without modification, as described in this samples:

  • 如果代码相对复杂,则使用 R 包 sqlrutils 来转换代码 。If the code is more complex, use the R package sqlrutils to convert your code. 该包旨在帮助有经验的 R 用户编写优秀的存储过程代码。This package is designed to help experienced R users write good stored procedure code.

    第一步是将 R 代码重写为具有明确定义的输入和输出的单个函数。The first step is to rewrite your R code as a single function with clearly defined inputs and outputs.

    然后,使用 sqlrutils 包生成格式正确的输入和输出 。Then, use the sqlrutils package to generate the input and outputs in the correct format. sqlrutils 包会生成完整的存储过程代码,还可以在数据库中注册该存储过程 。The sqlrutils package generates the complete stored procedure code for you, and can also register the stored procedure in the database.

    有关详细信息和示例,请参阅 sqlrutils (SQL)For more information and examples, see sqlrutils (SQL).

与其他工作流集成Integrate with other workflows

  • 利用 T-SQL 工具和 ETL 进程。Leverage T-SQL tools and ETL processes. 作为数据工作流的一部分,请预先执行特征工程、特征提取和数据清理。Perform feature engineering, feature extraction, and data cleansing in advance as part of data workflows.

    使用专用的 R 开发环境(如 用于 Visual Studio 的 R 工具R Tools for Visual Studio 或 RStudio)时,你可能会将数据提取到计算机,以迭代方式分析数据,然后写出或显示结果。When you are working in a dedicated R development environment such as 用于 Visual Studio 的 R 工具R Tools for Visual Studio or RStudio, you might pull data to your computer, analyze the data iteratively, and then write out or display the results.

    但是,如果将独立的 R 代码迁移到了 SQL Server,则此过程的大部分步骤可以简化或委派给其他 SQL Server 工具。However, when standalone R code is migrated to SQL Server, much of this process can be simplified or delegated to other SQL Server tools.

  • 使用安全的异步可视化策略。Use secure, asynchronous visualization strategies.

    SQL Server 用户通常无法访问服务器上的文件,且 SQL 客户端工具通常不支持 R 图形设备。Users of SQL Server often cannot access files on the server, and SQL client tools typically do not support the R graphics device. 如果解决方案的一部分是生成绘图或其他图形,则考虑将绘图导出为二进制数据并将其保存到表或写入。If you generate plots or other graphics as part of the solution, consider exporting the plots as binary data and saving to a table, or writing.

  • 在存储过程中包装预测和评分函数,供应用程序直接访问。Wrap prediction and scoring functions in stored procedures for direct access by applications.

后续步骤Next steps

若要查看如何在 SQL Server 中部署 R 解决方案的示例,请参阅以下示例:To view examples of how an R solution can be deployed in SQL Server, see these samples: