Step 5: Train and save a model using T-SQL

In this step, you learn how to train a machine learning model using the Python packages scikit-learn and revoscalepy. These Python libraries are already installed with SQL Server Machine Learning Services, so you can load the modules and call the necessary functions from within a stored procedure. You'll train the model using the data features you just created, and then save the trained model in a SQL Server table.

Split the sample data into training and testing sets

  1. Run the following T-SQL commands to create a stored procedure that divides the data in the nyctaxi_sample table into two parts: nyctaxi_sample_training and nyctaxi_sample_testing.

    CREATE PROCEDURE [dbo].[TrainTestSplit] (@pct int)
    AS
    
    DROP TABLE IF EXISTS dbo.nyctaxi_sample_training
    SELECT * into nyctaxi_sample_training FROM nyctaxi_sample WHERE (ABS(CAST(BINARY_CHECKSUM(medallion,hack_license)  as int)) % 100) < @pct
    
    DROP TABLE IF EXISTS dbo.nyctaxi_sample_testing
    SELECT * into nyctaxi_sample_testing FROM nyctaxi_sample
    WHERE (ABS(CAST(BINARY_CHECKSUM(medallion,hack_license)  as int)) % 100) > @pct
    GO
    
  2. Run the stored procedure, and type an integer that represents the percentage of data allocated to the training set. For example, the following statement would allocate 60% of data to the training set. Training and testing data are stored in two separate tables.

    EXEC TrainTestSplit 60
    GO
    

Build a logistic regression model using scikit-learn

In this section, you create a stored procedure that can be used to train a model using the training data you just prepared. This stored procedure defines the input data and uses a scikit-learn function to train a logistic regression model. You call the Python runtime that is installed with SQL Server by using the system stored procedure, sp_execute_external_script.

To make it easier to retrain the model, you can wrap the call to sp_execute_exernal_script in another stored procedure, and pass in the new training data as a parameter. This section will walk you through that process.

  1. In Management Studio, open a new Query window and run the following statement to create the stored procedure TrainTipPredictionModelSciKitPy. Note that the stored procedure contains a definition of the input data, so you don't need to provide an input query.

    DROP PROCEDURE IF EXISTS TrainTipPredictionModelSciKitPy;
    GO
    
    CREATE PROCEDURE [dbo].[TrainTipPredictionModelSciKitPy] (@trained_model varbinary(max) OUTPUT)
    AS
    BEGIN
      EXEC sp_execute_external_script
      @language = N'Python',
      @script = N'
      import numpy
      import pickle
      import pandas
      from sklearn.linear_model import LogisticRegression
    
      ##Create SciKit-Learn logistic regression model
      X = InputDataSet[["passenger_count", "trip_distance", "trip_time_in_secs", "direct_distance"]]
      y = numpy.ravel(InputDataSet[["tipped"]])
    
      SKLalgo = LogisticRegression()
      logitObj = SKLalgo.fit(X, y)
    
      ##Serialize model
      trained_model = pickle.dumps(logitObj)
      ',
      @input_data_1 = N'
      select tipped, fare_amount, passenger_count, trip_time_in_secs, trip_distance, 
      dbo.fnCalculateDistance(pickup_latitude, pickup_longitude,  dropoff_latitude, dropoff_longitude) as direct_distance
      from nyctaxi_sample_training
      ',
      @input_data_1_name = N'InputDataSet',
      @params = N'@trained_model varbinary(max) OUTPUT',
      @trained_model = @trained_model OUTPUT;
      ;
    END;
    GO
    
  2. Run the following SQL statements to insert the trained model into table nyc_taxi_models.

    DECLARE @model VARBINARY(MAX);
    EXEC TrainTipPredictionModelSciKitPy @model OUTPUT;
    INSERT INTO nyc_taxi_models (name, model) VALUES('SciKit_model', @model);
    

    Processing of the data and fitting the model might take a couple of mins. Messages that would be piped to Python's stdout stream are displayed in the Messages window of Management Studio. For example:

    STDOUT message(s) from external script: C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\revoscalepy

  3. Open the table nyc_taxi_models. You can see that one new row has been added, which contains the serialized model in the column model.

    linear_model 0x800363736B6C6561726E2E6C696E6561....

Build a logistic model using the revoscalepy package

Now, create a different stored procedure that uses the new revoscalepy package to train a logistic regression model. The revoscalepy package for Python contains objects, transformation, and algorithms similar to those provided for the R language's RevoScaleR package. With this library, you can create a compute context, move data between compute contexts, transform data, and train predictive models using popular algorithms such as logistic and linear regression, decision trees, and more. For more information, see what is revoscalepy?

  1. In Management Studio, open a new Query window and run the following statement to create the stored procedure TrainTipPredictionModelRxPy. This model also uses the training data you just prepared. Because the stored procedure already includes a definition of the input data, you don't need to provide an input query.

    DROP PROCEDURE IF EXISTS TrainTipPredictionModelRxPy;
    GO
    
    CREATE PROCEDURE [dbo].[TrainTipPredictionModelRxPy] (@trained_model varbinary(max) OUTPUT)
    AS
    BEGIN
    EXEC sp_execute_external_script 
      @language = N'Python',
      @script = N'
    import numpy
    import pickle
    import pandas
    from revoscalepy.functions.RxLogit import rx_logit_ex
    
    ## Create a logistic regression model using rx_logit_ex function from revoscalepy package
    logitObj = rx_logit_ex("tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance", data = InputDataSet);
    
    ## Serialize model
    trained_model = pickle.dumps(logitObj)
    ',
    @input_data_1 = N'
    select tipped, fare_amount, passenger_count, trip_time_in_secs, trip_distance, 
    dbo.fnCalculateDistance(pickup_latitude, pickup_longitude,  dropoff_latitude, dropoff_longitude) as direct_distance
    from nyctaxi_sample_training
    ',
    @input_data_1_name = N'InputDataSet',
    @params = N'@trained_model varbinary(max) OUTPUT',
    @trained_model = @trained_model OUTPUT;
    ;
    END;
    GO
    

    This stored procedure performs the following steps as part of model training:

    • Train a logistic regression model using revoscalepy package on nyctaxi_sample_training data.
    • The SELECT query uses the custom scalar function fnCalculateDistance to calculate the direct distance between the pick-up and drop-off locations. The results of the query are stored in the default Python input variable, InputDataset.
    • The Python script calls the revoscalepy's LogisticRegression function, which is included with R Services (In-Database), to create the logistic regression model.
    • The binary variable tipped is used as the label or outcome column, and the model is fit using these feature columns: passenger_count, trip_distance, trip_time_in_secs, and direct_distance.
    • The trained model, contained in the Python variable logitObj, is serialized and put as an output parameter SQL Server. That output is inserted into the database table nyc_taxi_models, along with its name, as a new row, so that you can retrieve and use it for future predictions.
  2. Run the stored procedure as follows to insert the trained revoscalepy model into the table _nyc_taxi_models.

    DECLARE @model VARBINARY(MAX);
    EXEC TrainTipPredictionModelRxPy @model OUTPUT;
    
    INSERT INTO nyc_taxi_models (name, model) VALUES('revoscalepy_model', @model);
    

    Processing of the data and fitting the model might take a while. Messages that would be piped to Python's stdout stream are displayed in the Messages window of Management Studio. For example:

    STDOUT message(s) from external script: C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\site-packages\revoscalepy

  3. Open the table nyc_taxi_models. You can see that one new row has been added, which contains the serialized model in the column model.

    rx_model 0x8003637265766F7363616c....

In the next step, you use the trained models to create predictions.

Next step

Step 6: Operationalize the model

Previous step

Step 4: Create data features using T-SQL