Quickstart: R functions with SQL machine learning

Applies to: SQL Server 2016 (13.x) and later Azure SQL Managed Instance

In this quickstart, you'll learn how to use R mathematical and utility functions with SQL Server Machine Learning Services or on Big Data Clusters. Statistical functions are often complicated to implement in T-SQL, but can be done in R with only a few lines of code.

In this quickstart, you'll learn how to use R mathematical and utility functions with SQL Server Machine Learning Services. Statistical functions are often complicated to implement in T-SQL, but can be done in R with only a few lines of code.

In this quickstart, you'll learn how to use R mathematical and utility functions with SQL Server R Services. Statistical functions are often complicated to implement in T-SQL, but can be done in R with only a few lines of code.

In this quickstart, you'll learn how to use data structures and data types when using R in Azure SQL Managed Instance Machine Learning Services. You'll learn about moving data between R and SQL Managed Instance, and the common issues that might occur.

Prerequisites

You need the following prerequisites to run this quickstart.

  • A tool for running SQL queries that contain R scripts. This quickstart uses Azure Data Studio.

Create a stored procedure to generate random numbers

For simplicity, let's use the R stats package, that's installed and loaded by default. The package contains hundreds of functions for common statistical tasks, among them the rnorm function, which generates a specified number of random numbers using the normal distribution, given a standard deviation and mean.

For example, the following R code returns 100 numbers on a mean of 50, given a standard deviation of 3.

as.data.frame(rnorm(100, mean = 50, sd = 3));

To call this line of R from T-SQL, add the R function in the R script parameter of sp_execute_external_script, like this:

EXECUTE sp_execute_external_script
      @language = N'R'
    , @script = N'
         OutputDataSet <- as.data.frame(rnorm(100, mean = 50, sd =3));'
    , @input_data_1 = N'   ;'
      WITH RESULT SETS (([Density] float NOT NULL));

What if you'd like to make it easier to generate a different set of random numbers?

That's easy when combined with T-SQL. You define a stored procedure that gets the arguments from the user, then pass those arguments into the R script as variables.

CREATE PROCEDURE MyRNorm (
    @param1 INT
    , @param2 INT
    , @param3 INT
    )
AS
EXECUTE sp_execute_external_script @language = N'R'
    , @script = N'
	     OutputDataSet <- as.data.frame(rnorm(mynumbers, mymean, mysd));'
    , @input_data_1 = N'   ;'
    , @params = N' @mynumbers int, @mymean int, @mysd int'
    , @mynumbers = @param1
    , @mymean = @param2
    , @mysd = @param3
WITH RESULT SETS(([Density] FLOAT NOT NULL));
  • The first line defines each of the SQL input parameters that are required when the stored procedure is executed.

  • The line beginning with @params defines all variables used by the R code, and the corresponding SQL data types.

  • The lines that immediately follow map the SQL parameter names to the corresponding R variable names.

Now that you've wrapped the R function in a stored procedure, you can easily call the function and pass in different values, like this:

EXECUTE MyRNorm @param1 = 100,@param2 = 50, @param3 = 3

Use R utility functions for troubleshooting

The utils package, installed by default, provides a variety of utility functions for investigating the current R environment. These functions can be useful if you're finding discrepancies in the way your R code performs in SQL Server and in outside environments.

For example, you might use the system timing functions in R, such as system.time and proc.time, to capture the time used by R processes and analyze performance issues. For an example, see the tutorial Create Data Features where R timing functions are embedded in the solution.

EXECUTE sp_execute_external_script
      @language = N'R'
    , @script = N'
        library(utils);
        start.time <- proc.time();
        
        # Run R processes
        
        elapsed_time <- proc.time() - start.time;'

For other useful functions, see Use R code profiling functions to improve performance.

Next steps

To create a machine learning model using R with SQL machine learning, follow this quickstart: