Quickstart: Create a predictive model using R in SQL Server

APPLIES TO: yesSQL Server noAzure SQL Database noAzure SQL Data Warehouse noParallel Data Warehouse

In this quickstart, you'll learn how to train a model using R, and then save the model to a table in SQL Server. The model is a simple generalized linear model (GLM) that predicts probability that a vehicle has been fitted with a manual transmission. You'll use the mtcars dataset included with R.

Prerequisites

A previous quickstart, Verify R exists in SQL Server, provides information and links for setting up the R environment required for this quickstart.

Create the source data

First, create a table to save the training data.

CREATE TABLE dbo.MTCars(
    mpg decimal(10, 1) NOT NULL,
    cyl int NOT NULL,
    disp decimal(10, 1) NOT NULL,
    hp int NOT NULL,
    drat decimal(10, 2) NOT NULL,
    wt decimal(10, 3) NOT NULL,
    qsec decimal(10, 2) NOT NULL,
    vs int NOT NULL,
    am int NOT NULL,
    gear int NOT NULL,
    carb int NOT NULL
);

Next, insert the data from the build in dataset mtcars.

INSERT INTO dbo.MTCars
EXEC sp_execute_external_script
        @language = N'R'
        , @script = N'MTCars <- mtcars;'
        , @input_data_1 = N''
        , @output_data_1_name = N'MTCars';
  • Some people like to use temporary tables, but be aware that some R clients disconnect sessions between batches.

  • Many datasets, small and large, are included with the R runtime. To get a list of datasets installed with R, type library(help="datasets") from an R command prompt.

Create a model

The car speed data contains two columns, both numeric, horsepower (hp) and weight (wt). From this data, you will create a generalized linear model (GLM) that estimates the probability that a vehicle has been fitted with a manual transmission.

To build the model, you define the formula inside your R code, and pass the data as an input parameter.

DROP PROCEDURE IF EXISTS generate_GLM;
GO
CREATE PROCEDURE generate_GLM
AS
BEGIN
    EXEC sp_execute_external_script
    @language = N'R'
    , @script = N'carsModel <- glm(formula = am ~ hp + wt, data = MTCarsData, family = binomial);
        trained_model <- data.frame(payload = as.raw(serialize(carsModel, connection=NULL)));'
    , @input_data_1 = N'SELECT hp, wt, am FROM MTCars'
    , @input_data_1_name = N'MTCarsData'
    , @output_data_1_name = N'trained_model'
    WITH RESULT SETS ((model VARBINARY(max)));
END;
GO
  • The first argument to glm is the formula parameter, which defines am as dependent on hp + wt.
  • The input data is stored in the variable MTCarsData, which is populated by the SQL query. If you don't assign a specific name to your input data, the default variable name would be InputDataSet.

Create a table for the model

Next, store the model so you can retrain or use it for prediction. The output of an R package that creates a model is usually a binary object. Therefore, the table where you store the model must provide a column of varbinary(max) type.

CREATE TABLE GLM_models (
    model_name varchar(30) not null default('default model') primary key,
    model varbinary(max) not null
);

Save the model

To save the model, run the following Transact-SQL statement to call the stored procedure, generate the model, and save it to a table.

INSERT INTO GLM_models(model)
EXEC generate_GLM;

Note that if you run this code a second time, you get this error:

Violation of PRIMARY KEY constraint...Cannot insert duplicate key in object dbo.stopping_distance_models

One option for avoiding this error is to update the name for each new model. For example, you could change the name to something more descriptive, and include the model type, the day you created it, and so forth.

UPDATE GLM_models
SET model_name = 'GLM_' + format(getdate(), 'yyyy.MM.HH.mm', 'en-gb')
WHERE model_name = 'default model'

Next steps

Now that you have a model, in the final quickstart, you'll learn how to generate predictions from it and plot the results.