SQL Server 机器学习服务中的扩展性体系结构Extensibility architecture in SQL Server Machine Learning Services

适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) 及更高版本适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) and later

本文介绍用于在 SQL Server 机器学习服务上运行外部 Python 或 R 脚本的扩展性框架的体系结构。This article describes the architecture of the extensibility framework for running an external Python or R script on SQL server Machine Learning Services. 该脚本在语言运行时环境中作为核心数据库引擎的扩展执行。The script executes in a language runtime environment as an extension to the core database engine.

背景Background

在 SQL Server 2016 中引入了扩展性框架,用于支持 R Services 的 R 运行时。The extensibility framework was introduced in SQL Server 2016 to support the R runtime with R Services. SQL Server 2017 及更高版本支持机器学习服务的 Python。SQL Server 2017 and later has support for Python with Machine Learning Services.

扩展性框架用于在 SQL Server 与数据科学语言(例如 R 和 Python)之间提供一个接口。The purpose of the extensibility framework is to provide an interface between SQL Server and data science languages such as R and Python. 目的是在将数据科学解决方案迁移到生产环境时减少摩擦,以及保护开发过程中公开的数据。The goal is to reduce friction when moving data science solutions into production, and protecting data exposed during the development process. 通过在 SQL Server 管理的安全框架内执行受信任的脚本语言,数据库管理员可以在维持安全性的同时允许数据科学家访问企业数据。By executing a trusted scripting language within a secure framework managed by SQL Server, database administrators can maintain security while allowing data scientists access to enterprise data.

下图直观地说明了可扩展体系结构的机会和优势。The following diagram visually describes opportunities and benefits of the extensible architecture.

与 SQL Server 集成的目标Goals of integration with SQL Server

可以通过调用存储过程来运行外部脚本,然后将结果以表格结果的形式直接返回到 SQL Server。An external script can be run by calling a stored procedure, and the results are returned as tabular results directly to SQL Server. 这样就可以在任何可以发送 SQL 查询并处理结果的应用程序中轻松地生成或使用机器学习。This makes it easy to generate or consume machine learning from any application that can send a SQL query and handle the results.

  • 外部脚本执行受 SQL Server 数据安全性的约束。External script execution is subject to SQL Server data security. 运行外部脚本的用户只能访问在 SQL 查询中同样可用的数据。A user running an external script can only access data that is equally available in a SQL query. 如果由于权限不足而导致查询失败,则同一用户运行的脚本也会因为同一原因而失败。If a query fails due to insufficient permission, a script run by the same user would also fail for the same reason. 系统在表、数据库和实例级别实施 SQL Server 安全性。SQL Server security is enforced at the table, database, and instance level. 数据库管理员可以管理用户访问权限、外部脚本使用的资源以及添加到服务器的外部代码库。Database administrators can manage user access, resources used by external scripts, and external code libraries added to the server.

  • 可通过两种方式获得扩展和优化机会:受益于数据库平台(列存储索引、资源治理);以及特定于扩展插件的受益,例如,将适用于 R 和 Python 的 Microsoft 库用于数据科学模型时。Scale and optimization opportunities have a dual basis: gains through the database platform (ColumnStore indexes, resource governance); and extension-specific gains, for example when Microsoft libraries for R and Python are used for data science models. R 是单线程的,而 RevoScaleR 函数是多线程的,后者能够在多个内核之间分配工作负荷。Whereas R is single-threaded, RevoScaleR functions are multi-threaded, capable of distributing a workload over multiple cores.

  • 部署使用 SQL Server 方法。Deployment uses SQL Server methodologies. 这些方法可以是包装了外部脚本的存储过程、嵌入式 SQL 或调用 PREDICT(用于从服务器上保留的预测模型返回结果)等函数的 T-SQL 查询。These can be stored procedures wrapping an external script, embedded SQL, or T-SQL queries calling functions like PREDICT to return results from forecasting models persisted on the server.

  • 熟练使用特定工具和 IDE 的开发人员可以在这些工具中编写代码,然后将代码移植到 SQL Server。Developers with established skills in specific tools and IDEs can write code in those tools and then port the code to SQL Server.

体系结构关系图Architecture diagram

该体系结构的设计理念是让外部脚本在与 SQL Server 不同的进程中运行,但具有在内部管理 SQL Server 上的数据和操作请求链的组件。The architecture is designed such that external scripts run in a separate process from SQL Server, but with components that internally manage the chain of requests for data and operations on SQL Server. 根据 SQL Server 版本的不同,支持的语言扩展包括 RPython 以及第三方语言(例如 Java 和 .NET)。Depending on the version of SQL Server, supported language extensions include R, Python, and third-party languages such as Java and .NET.

Windows 中的组件体系结构:_**Component architecture in Windows: _

Windows 组件体系结构Windows component architecture

Linux 中的组件体系结构:Component architecture in Linux:

Linux 组件体系结构Linux component architecture

组件包括用于调用外部运行时的 _Launchpad* 服务以及用于加载解释器和库的特定于库的逻辑。Components include a _ launchpad* service used to invoke external runtimes and library-specific logic for loading interpreters and libraries. 启动器用于加载语言运行时和所有专有模块。The launcher loads a language runtime, plus any proprietary modules. 例如,如果代码包含 RevoScaleR 函数,则会加载 RevoScaleR 解释器。For example, if your code includes RevoScaleR functions, a RevoScaleR interpreter is loaded. BxlServerSQL Satellite 管理与 SQL Server 的通信和数据传输。BxlServer and SQL Satellite manage communication and data transfer with SQL Server.

在 Linux 中,SQL 使用 Launchpad 服务与每个用户的独立 Launchpad 进程进行通信。In Linux, SQL uses a launchpadd service to communicate with a separate launchpad process for each user.

启动板Launchpad

SQL Server LaunchpadSQL Server Launchpad 是用来管理和执行外部脚本的服务,其工作方式类似于全文索引和查询服务启动单独的主机来处理全文查询。The SQL Server LaunchpadSQL Server Launchpad is a service that manages and executes external scripts, similar to the way that the full-text indexing and query service launches a separate host for processing full-text queries. 该 Launchpad 服务只能启动 Microsoft 发布的受信任启动器,或者经 Microsoft 认证满足性能和资源管理要求的启动器。The launchpad service can start only trusted launchers that are published by Microsoft, or that have been certified by Microsoft as meeting requirements for performance and resource management.

受信任的启动器Trusted launchers 分机Extension SQL Server 版本SQL Server versions
适用于 Windows 的 R 语言的 RLauncher.dllRLauncher.dll for the R language for Windows R 扩展R extension SQL Server 2016 及更高版本SQL Server 2016 and later
适用于 Windows 的 Python 3.5 的 Pythonlauncher.dllPythonlauncher.dll for Python 3.5 for Windows Python 扩展Python extension SQL Server 2017 及更高版本SQL Server 2017 and later
适用于 Linux 的 R 语言的 RLauncher.soRLauncher.so for the R language for Linux R 扩展R extension SQL Server 2019 及更高版本SQL Server 2019 and later
适用于 Linux 的 Python 3.5 的 Pythonlauncher.soPythonlauncher.so for Python 3.5 for Linux Python 扩展Python extension SQL Server 2019 及更高版本SQL Server 2019 and later

SQL Server LaunchpadSQL Server Launchpad 服务在其自身的用户帐户下运行。The SQL Server LaunchpadSQL Server Launchpad service runs under its own user account. 如果更改运行 Launchpad 的帐户,请务必使用 SQL Server 配置管理器来执行此操作,以确保将更改写入相关文件。If you change the account that runs launchpad, be sure to do so using SQL Server Configuration Manager, to ensure that changes are written to related files.

在 Windows 中,系统会为每个已添加 SQL Server 机器学习服务的数据库引擎实例创建一个单独的 SQL Server LaunchpadSQL Server Launchpad 服务。In Windows, a separate SQL Server LaunchpadSQL Server Launchpad service is created for each database engine instance to which you have added SQL Server Machine Learning Services. 每个数据库引擎实例都有一个 Launchpad 服务,因此如果有多个具有外部脚本支持的实例,则每个实例都有一个 Launchpad 服务。There is one launchpad service for each database engine instance, so if you have multiple instances with external script support, you will have a launchpad service for each one. 数据库引擎实例会绑定到为它创建的 Launchpad 服务。A database engine instance is bound to the launchpad service created for it. 存储过程或 T-SQL 中对外部脚本的所有调用都会导致 SQL Server 服务调用为同一实例创建的 Launchpad 服务。All invocations of external script in a stored procedure or T-SQL result in the SQL Server service calling the launchpad service created for the same instance.

若要以支持的特定语言执行任务,Launchpad 会从池中获取安全工作线程帐户,并启动附属进程以管理外部运行时。To execute tasks in a specific supported language, the launchpad gets a secured worker account from the pool, and starts a satellite process to manage the external runtime. 每个附属进程都会继承 Launchpad 的用户帐户,并在脚本执行期间使用该工作线程帐户。Each satellite process inherits the user account of the launchpad and uses that worker account for the duration of script execution. 如果脚本使用并行进程,则会在同一个工作线程帐户下创建这些进程。If script uses parallel processes, they are created under the same, single worker account.

在 Linux 中,仅支持一个数据库引擎实例,并且该实例绑定一个 Launchpad 服务。In Linux, only one database engine instance is supported and there is one launchpadd service bound to the instance. 执行脚本后,该 Launchpad 服务使用低特权用户帐户 mssql_satellite 启动一个单独的 Launchpad 进程。When a script is executed, the launchpadd service starts a separate launchpad process with the low-privileged user account mssql_satellite. 每个附属进程都会继承 Launchpad 的 mssql_satellite 用户帐户,并在脚本执行期间使用该帐户。Each satellite process inherits the mssql_satellite user account of launchpad and uses that for the duration of script execution.

BxlServer 和 SQL SatelliteBxlServer and SQL Satellite

BxlServer 是由 Microsoft 提供的一个可执行文件,用于管理 SQL Server 和语言运行时之间的通信。BxlServer is an executable provided by Microsoft that manages communication between SQL Server and the language runtime. 它为 Windows 创建 Windows 作业对象,或为 Linux 创建命名空间,用来包含外部脚本会话。It creates the Windows job objects for Windows, or the namespaces for Linux, that are used to contain external script sessions. 它还为每个外部脚本作业预配安全的工作文件夹,并使用 SQL Satellite 管理外部运行时与 SQL Server 之间的数据传输。It also provisions secure working folders for each external script job and uses SQL Satellite to manage data transfer between the external runtime and SQL Server. 如果在某个作业运行时运行进程资源管理器,则可能会看到一个或多个 BxlServer 实例。If you run Process Explorer while a job is running, you might see one or multiple instances of BxlServer.

实际上,BxlServer 是语言运行时环境的配套项,该运行时环境与 SQL Server 一起传输数据和管理任务。In effect, BxlServer is a companion to a language runtime environment that works with SQL Server to transfer data and manage tasks. BXL 是二进制交换语言的缩写,是指在 SQL Server 与外部进程之间有效移动数据时所用的数据格式。BXL stands for Binary Exchange language and refers to the data format used to move data efficiently between SQL Server and external processes. BxlServer 也是 Microsoft R Client 和 Microsoft R Server 等相关产品的重要组成部分。BxlServer is also an important part of related products such as Microsoft R Client and Microsoft R Server.

SQL Satellite 是数据库引擎中包含的扩展性 API,它支持使用 C 或 C++ 实现的外部代码或外部运行时。SQL Satellite is an extensibility API, included in the database engine, that supports external code or external runtimes implemented using C or C++.

BxlServer 使用 SQL Satellite 执行以下任务:BxlServer uses SQL Satellite for these tasks:

  • 读取输入数据Reading input data
  • 写入输出数据Writing output data
  • 获取输入参数Getting input arguments
  • 写入输出参数Writing output arguments
  • 错误处理。Error handling
  • 将 STDOUT 和 STDERR 写回客户端Writing STDOUT and STDERR back to client

SQL Satellite 使用自定义数据格式,这种格式已针对 SQL Server 与外部脚本语言之间的快速数据传输进行优化。SQL Satellite uses a custom data format that is optimized for fast data transfer between SQL Server and external script languages. 在 SQL Server 与外部脚本运行时通信期间,SQL Satellite 会执行类型转换并定义输入和输出数据集的架构。It performs type conversions and defines the schemas of the input and output datasets during communications between SQL Server and the external script runtime.

可以使用 Windows 扩展事件 (xEvent) 来监视 SQL Satellite。The SQL Satellite can be monitored by using windows extended events (xEvents). 有关详细信息,请参阅 SQL Server 机器学习服务的扩展事件For more information, see Extended Events for SQL Server Machine Learning Services.

各组件之间的通信通道Communication channels between components

此部分介绍组件和数据平台之间的通信协议。Communication protocols among components and data platforms are described in this section.

  • TCP/IPTCP/IP

    默认情况下,SQL Server 与 SQL Satellite 之间的内部通信使用 TCP/IP。By default, internal communications between SQL Server and the SQL Satellite use TCP/IP.

  • 命名管道Named Pipes

    BxlServer 与 SQL Server 之间通过 SQL Satellite 进行的内部数据传输使用专有的压缩数据格式来增强性能。Internal data transport between the BxlServer and SQL Server through SQL Satellite uses a proprietary, compressed data format to enhance performance. 可使用命名管道以 BXL 格式在语言运行时和 BxlServer 之间交换数据。Data is exchanged between language run times and BxlServer in BXL format, using Named Pipes.

  • ODBCODBC

    外部数据科学客户端与远程 SQL Server 实例之间的通信使用 ODBC。Communications between external data science clients and a remote SQL Server instance use ODBC. 将脚本作业发送到 SQL Server 的帐户必须有权连接到实例,同时有权运行外部脚本。The account that sends the script jobs to SQL Server must have both permissions to connect to the instance and to run external scripts.

    此外,根据任务的不同,该帐户可能需要以下权限:Additionally, depending on the task, the account might need these permissions:

    • 读取作业使用的数据Read data used by the job
    • 将数据写入表:例如,将结果保存到表中时Write data to tables: for example, when saving results to a table
    • 创建数据库对象:例如,将外部脚本保存为新存储过程的一部分时。Create database objects: for example, if saving external script as part of a new stored procedure.

    如果将 SQL Server 用作从远程客户端执行的脚本的计算上下文,并且可执行文件必须从外部源检索数据,则 ODBC 将用于写回。When SQL Server is used as the compute context for script executed from a remote client, and the executable must retrieve data from an external source, ODBC is used for writeback. SQL Server 将发出远程命令的用户标识映射到当前实例上的用户标识,并使用该用户的凭据运行 ODBC 命令。SQL Server maps the identity of the user issuing the remote command to the identity of the user on the current instance, and runs the ODBC command using that user's credentials. 执行此 ODBC 调用所需的连接字符串从客户端代码中获取。The connection string needed to perform this ODBC call is obtained from the client code.

  • RODBC(仅限 R)RODBC (R only)

    可以使用 RODBC 在脚本内部发出其他 ODBC 调用。Additional ODBC calls can be made inside the script by using RODBC. RODBC 是用于访问关系数据库中数据的常用 R 包;但是,与 SQL Server 使用的同类提供程序相比,其性能通常较慢。RODBC is a popular R package used to access data in relational databases; however, its performance is generally slower than comparable providers used by SQL Server. 许多 R 脚本使用 RODBC 的嵌入式调用,类似于检索用于分析的“辅助”数据集。Many R scripts use embedded calls to RODBC as a way of retrieving "secondary" datasets for use in analysis. 例如,用于训练模型的存储过程可以定义一个 SQL 查询来获取用于训练模型的数据,但使用嵌入式 RODBC 调用来获取其他因子,以执行查找,或者从文本文件或 Excel 等外部源中获取新数据。For example, the stored procedure that trains a model might define a SQL query to get the data for training a model, but use an embedded RODBC call to get additional factors, to perform lookups, or to get new data from external sources such as text files or Excel.

    以下代码演示了 R 脚本中嵌入的 RODBC 调用:The following code illustrates an RODBC call embedded in an R script:

    library(RODBC);
    connStr <- paste("Driver=SQL Server;Server=", instance_name, ";Database=", database_name, ";Trusted_Connection=true;", sep="");
    dbhandle <- odbcDriverConnect(connStr)
    OutputDataSet <- sqlQuery(dbhandle, "select * from table_name");
    
  • 其他协议Other protocols

    可能需要在“区块”中工作或将数据传输回远程客户端的进程还可以使用 XDF 文件格式Processes that might need to work in "chunks" or transfer data back to a remote client can also use the XDF file format. 实际数据传输通过编码的 blob 进行。Actual data transfer is via encoded blobs.

另请参阅See Also