R 教程:训练并保存模型R tutorial: Train and save model

适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) 及更高版本 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) and later 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance

在此系列教程的第四部分(共五部分),你将学习如何使用 R 训练机器学习模型。使用在上一部分中创建的数据特征来训练该模型,然后将训练后的模型保存到一个 SQL ServerSQL Server 表中。In part four of this five-part tutorial series, you'll learn how to train a machine learning model by using R. You'll train the model using the data features you created in the previous part, and then save the trained model in a SQL ServerSQL Server table. 在本例中,R 包已随 R Services(数据库内)R Services (In-Database) 安装,因此可通过 SQL 完成所有操作。In this case, the R packages are already installed with R Services(数据库内)R Services (In-Database), so everything can be done from SQL.

在本文中,你将:In this article, you'll:

  • 使用 SQL 存储过程创建并训练模型Create and train a model using a SQL stored procedure
  • 将训练后的模型保存到 SQL 表中Save the trained model to a SQL table

第一部分中,你安装了必备条件并还原了示例数据库。In part one, you installed the prerequisites and restored the sample database.

第二部分中,你查看了示例数据,并生成一些绘图。In part two, you reviewed the sample data and generate some plots.

第三部分中,你学习了如何使用 Transact-SQL 函数根据原始数据创建特征。In part three, you learned how to create features from raw data by using a Transact-SQL function. 然后从存储过程调用了该函数,创建了包含该功能值的表。You then called that function from a stored procedure to create a table that contains the feature values.

第五部分中,你将了解如何操作在第四部分中训练和保存的模型。In part five, you'll learn how to operationalize the models that you trained and saved in part four.

创建存储过程Create the stored procedure

从 T-SQL 调用 R 时,可以使用系统存储过程 sp_execute_external_scriptWhen calling R from T-SQL, you use the system stored procedure, sp_execute_external_script. 但是,对于经常重复执行的过程(如重新训练模型),在其他存储过程中封装对 sp_execute_external_script 的调用则会更加方便。However, for processes that you repeat often, such as retraining a model, it is easier to encapsulate the call to sp_execute_external_script in another stored procedure.

  1. Management StudioManagement Studio 中,打开一个新的“查询”窗口。In Management StudioManagement Studio, open a new Query window.

  2. 运行以下语句,创建存储过程 RTrainLogitModel。Run the following statement to create the stored procedure RTrainLogitModel. 此存储过程定义输入数据,并使用 glm 来创建逻辑回归模型。This stored procedure defines the input data and uses glm to create a logistic regression model.

    CREATE PROCEDURE [dbo].[RTrainLogitModel] (@trained_model varbinary(max) OUTPUT)
    
    AS
    BEGIN
      DECLARE @inquery nvarchar(max) = N'
        select tipped, fare_amount, passenger_count,trip_time_in_secs,trip_distance,
        pickup_datetime, dropoff_datetime,
        dbo.fnCalculateDistance(pickup_latitude, pickup_longitude,  dropoff_latitude, dropoff_longitude) as direct_distance
        from nyctaxi_sample
        tablesample (70 percent) repeatable (98052)
    '
    
      EXEC sp_execute_external_script @language = N'R',
                                      @script = N'
    ## Create model
    logitObj <- glm(tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance, data = InputDataSet, family = binomial)
    summary(logitObj)
    
    ## Serialize model 
    trained_model <- as.raw(serialize(logitObj, NULL));
    ',
      @input_data_1 = @inquery,
      @params = N'@trained_model varbinary(max) OUTPUT',
      @trained_model = @trained_model OUTPUT; 
    END
    GO
    
    • 为了确保留下一些数据以测试模型,70% 的数据是出于定型目的从出租车数据表中随机选择的。To ensure that some data is left over to test the model, 70% of the data are randomly selected from the taxi data table for training purposes.

    • SELECT 查询使用自定义标量函数 fnCalculateDistance 计算上车与下车位置之间的直接距离。The SELECT query uses the custom scalar function fnCalculateDistance to calculate the direct distance between the pick-up and drop-off locations. 查询的结果存储在默认 R 输入变量 InputDataset 中。The results of the query are stored in the default R input variable, InputDataset.

    • R 脚本调用 R 函数 glm 来创建逻辑回归模型。The R script calls the R function glm to create the logistic regression model.

      二进制变量 tipped 用作标签或结果列,模型使用以下这些特征列进行调整:passenger_counttrip_distancetrip_time_in_secsdirect_distanceThe binary variable tipped is used as the label or outcome column, and the model is fit using these feature columns: passenger_count, trip_distance, trip_time_in_secs, and direct_distance.

    • 已定型模型(保存在 R 变量 logitObj 中)会进行序列化,并作为输出参数返回。The trained model, saved in the R variable logitObj, is serialized and returned as an output parameter.

使用存储过程定型并部署 R 模型Train and deploy the R model using the stored procedure

因为存储过程已包含输入数据的定义,所以无需提供输入查询。Because the stored procedure already includes a definition of the input data, you don't need to provide an input query.

  1. 要定型和部署 R 模型,请调用存储过程,并将其插入到数据库表 nyc_taxi_models 中,以便可以将其用于未来的预测:To train and deploy the R model, call the stored procedure and insert it into the database table nyc_taxi_models, so that you can use it for future predictions:

    DECLARE @model VARBINARY(MAX);
    EXEC RTrainLogitModel @model OUTPUT;
    INSERT INTO nyc_taxi_models (name, model) VALUES('RTrainLogit_model', @model);
    
  2. 查看 Management StudioManagement Studio 的“消息”窗口,以查看通过管道传递到 R 的 stdout 流的消息,如下所示:Watch the Messages window of Management StudioManagement Studio for messages that would be piped to R's stdout stream, like this message:

    “来自外部脚本的 STDOUT 消息:读取的行数:1193025,已处理的总行数:1193025,总区块时间:0.093 秒”"STDOUT message(s) from external script: Rows Read: 1193025, Total Rows Processed: 1193025, Total Chunk Time: 0.093 seconds"

  3. 语句完成后,打开表 nyc_taxi_models。When the statement has completed, open the table nyc_taxi_models. 数据处理和模型调整可能需要一些时间。Processing of the data and fitting the model might take a while.

    可以看到已添加了一个新行,该行在列 model 中包含序列化模型,在列 name 中包含模型名称 RTrainLogit_model。You can see that one new row has been added, which contains the serialized model in the column model and the model name RTrainLogit_model in the column name.

    model                        name
    ---------------------------- ------------------
    0x580A00000002000302020....  RTrainLogit_model
    

在此教程的下一部分中,你将使用已训练模型来生成预测。In the next part of this tutorial you'll use the trained model to generate predictions.

后续步骤Next steps

本文内容:In this article, you:

  • 使用 SQL 存储过程创建并训练了模型Created and trained a model using a SQL stored procedure
  • 将训练后的模型保存到了 SQL 表中Saved the trained model to a SQL table