Create a predictive model (R in SQL quickstart)
In this step, you'll learn how to train a model using R, and then save the model to a table in SQL Server. The model is a simple regression model that predicts the stopping distance of a car based on speed. You'll use the
cars dataset included with R, because it is small and easy to understand.
Create the source data
First, create a table to save the training data.
CREATE TABLE CarSpeed ([speed] int not null, [distance] int not null) INSERT INTO CarSpeed EXEC sp_execute_external_script @language = N'R' , @script = N'car_speed <- cars;' , @input_data_1 = N'' , @output_data_1_name = N'car_speed'
Some people like to use temporary tables, but be aware that some R clients disconnect sessions between batches.
Many datasets, small and large, are included with the R runtime. To get a list of datasets installed with R, type
library(help="datasets")from an R command prompt.
Create a regression model
The car speed data contains two columns, both numeric,
speed. There are multiple observations of some speeds. From this data, you will create a linear regression model that describes some relationship between car speed and the distance required to stop a car.
The requirements of a linear model are simple:
Define a formula that describes the relationship between the dependent variable
speedand the independent variable
Provide input data to use in training the model
If you need a refresher on linear models, we recommend this tutorial, which describes the process of fitting a model using rxLinMod: Fitting Linear Models
To actually build the model, you define the formula inside your R code, and pass the data as an input parameter.
DROP PROCEDURE IF EXISTS generate_linear_model; GO CREATE PROCEDURE generate_linear_model AS BEGIN EXEC sp_execute_external_script @language = N'R' , @script = N'lrmodel <- rxLinMod(formula = distance ~ speed, data = CarsData); trained_model <- data.frame(payload = as.raw(serialize(lrmodel, connection=NULL)));' , @input_data_1 = N'SELECT [speed], [distance] FROM CarSpeed' , @input_data_1_name = N'CarsData' , @output_data_1_name = N'trained_model' WITH RESULT SETS ((model varbinary(max))); END; GO
- The first argument to rxLinMod is the formula parameter, which defines distance as dependent on speed.
- The input data is stored in the variable
CarsData, which is populated by the SQL query. If you don't assign a specific name to your input data, the default variable name would be InputDataSet.
Create a table for storing the model
Next, store the model so you can retrain or use it for prediction. The output of an R package that creates a model is usually a binary object. Therefore, the table where you store the model must provide a column of varbinary type.
CREATE TABLE stopping_distance_models ( model_name varchar(30) not null default('default model') primary key, model varbinary(max) not null);
Save the model
To save the model, run the following Transact-SQL statement to call the stored procedure, generate the model, and save it to a table.
INSERT INTO stopping_distance_models (model) EXEC generate_linear_model;
Note that if you run this code a second time, you get this error:
Violation of PRIMARY KEY constraint...Cannot insert duplicate key in object dbo.stopping_distance_models
One option for avoiding this error is to update the name for each new model. For example, you could change the name to something more descriptive, and include the model type, the day you created it, and so forth.
UPDATE stopping_distance_models SET model_name = 'rxLinMod ' + format(getdate(), 'yyyy.MM.HH.mm', 'en-gb') WHERE model_name = 'default model'
Output additional variables
Generally, the output of R from the stored procedure sp_execute_external_script is limited to a single data frame. (This limitation might be removed in future.)
However, you can return outputs of other types, such as scalars, in addition to the data frame.
For example, suppose you want to train a model but immediately view a table of coefficients from the model. You could create the table of coefficients as the main result set, and output the trained model in a SQL variable. You could immediately re-use the model by calling the variable, or you could save the model to a table as shown here.
DECLARE @model varbinary(max), @modelname varchar(30) EXEC sp_execute_external_script @language = N'R' , @script = N' speedmodel <- rxLinMod(distance ~ speed, CarsData) modelbin <- serialize(speedmodel, NULL) OutputDataSet <- data.frame(coefficients(speedmodel));' , @input_data_1 = N'SELECT [speed], [distance] FROM CarSpeed' , @input_data_1_name = N'CarsData' , @params = N'@modelbin varbinary(max) OUTPUT' , @modelbin = @model OUTPUT WITH RESULT SETS (([Coefficient] float not null)); -- Save the generated model INSERT INTO [dbo].[stopping_distance_models] (model_name, model) VALUES (' latest model', @model)
Remember these rules for working with SQL parameters and R variables in
- All SQL parameters mapped to R script must be listed by name in the @params argument.
- To output one of these parameters, add the OUTPUT keyword in the @params list.
- After listing the mapped parameters, provide the mapping, line by line, of SQL parameters to R variables, immediately after the @params list.
Now that you have a model, in the final step, you'll learn how to generate predictions from it and plot the results.