Tutorial: Prepare data to train a predictive model in R with Azure SQL Database Machine Learning Services (preview)

In part one of this three-part tutorial series, you'll import and prepare data from an Azure SQL database using R. Later in this series, you'll use this data to train and deploy a predictive machine learning model in R with Azure SQL Database Machine Learning Services (preview).

For this tutorial series, imagine you own a ski rental business and you want to predict the number of rentals that you'll have on a future date. This information will help you get your stock, staff, and facilities ready.

In parts one and two of this series, you'll develop some R scripts in RStudio to prepare your data and train a machine learning model. Then, in part three, you'll run those R scripts inside a SQL database using stored procedures.

In this article, you'll learn how to:

  • Import a sample database into an Azure SQL database using R
  • Load the data from the Azure SQL database into an R data frame
  • Prepare the data in R by identifying some columns as categorical

In part two, you'll learn how to create and train multiple machine learning models in R, and then choose the most accurate one.

In part three, you'll learn how to store the model in a database, and then create stored procedures from the R scripts you developed in parts one and two. The stored procedures will run in a SQL database to make predictions based on new data.

Important

Azure SQL Database Machine Learning Services is currently in public preview. This preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Sign up for the preview

Prerequisites

  • Azure subscription - If you don't have an Azure subscription, create an account before you begin.

  • Azure SQL Database Server with Machine Learning Services enabled - During the public preview, Microsoft will onboard you and enable machine learning for your existing or new databases. Follow the steps in Sign up for the preview.

  • RevoScaleR package - See RevoScaleR for options to install this package locally.

  • R IDE - This tutorial uses RStudio Desktop.

  • SQL query tool - This tutorial assumes you're using Azure Data Studio or SQL Server Management Studio (SSMS).

Sign in to the Azure portal

Sign in to the Azure portal.

Import the sample database

The sample dataset used in this tutorial has been saved to a .bacpac database backup file for you to download and use.

  1. Download the file TutorialDB.bacpac.

  2. Follow the directions in Import a BACPAC file to create an Azure SQL database, using these details:

    • Import from the TutorialDB.bacpac file you downloaded
    • During the public preview, choose the Gen5/vCore configuration for the new database
    • Name the new database "TutorialDB"

Load the data into a data frame

To use the data in R, you'll load the data from the Azure SQL database into a data frame (rentaldata).

Create a new RScript file in RStudio and run the following script. Replace Server, UID, and PWD with your own connection information.

#Define the connection string to connect to the TutorialDB database
connStr <- paste("Driver=SQL Server",
               "; Server=", "<Azure SQL Database Server>",
               "; Database=TutorialDB",
               "; UID=", "<user>",
               "; PWD=", "<password>",
                  sep = "");

#Get the data from the table
SQL_rentaldata <- RxSqlServerData(table = "dbo.rental_data", connectionString = connStr, returnDataFrame = TRUE);

#Import the data into a data frame
rentaldata <- rxImport(SQL_rentaldata);

#Take a look at the structure of the data and the top rows
head(rentaldata);
str(rentaldata);

You should see results similar to the following.

   Year  Month  Day  RentalCount  WeekDay  Holiday  Snow
1  2014    1     20      445         2        1      0
2  2014    2     13       40         5        0      0
3  2013    3     10      456         1        0      0
4  2014    3     31       38         2        0      0
5  2014    4     24       23         5        0      0
6  2015    2     11       42         4        0      0
'data.frame':       453 obs. of  7 variables:
$ Year       : int  2014 2014 2013 2014 2014 2015 2013 2014 2013 2015 ...
$ Month      : num  1 2 3 3 4 2 4 3 4 3 ...
$ Day        : num  20 13 10 31 24 11 28 8 5 29 ...
$ RentalCount: num  445 40 456 38 23 42 310 240 22 360 ...
$ WeekDay    : num  2 5 1 2 5 4 1 7 6 1 ...
$ Holiday    : int  1 0 0 0 0 0 0 0 0 0 ...
$ Snow       : num  0 0 0 0 0 0 0 0 0 0 ...

Prepare the data

In this sample database, most of the preparation has already been done, but you'll do one more preparation here. Use the following R script to identify three columns as categories by changing the data types to factor.

#Changing the three factor columns to factor types
rentaldata$Holiday <- factor(rentaldata$Holiday);
rentaldata$Snow    <- factor(rentaldata$Snow);
rentaldata$WeekDay <- factor(rentaldata$WeekDay);

#Visualize the dataset after the change
str(rentaldata);

You should see results similar to the following.

data.frame':      453 obs. of  7 variables:
$ Year       : int  2014 2014 2013 2014 2014 2015 2013 2014 2013 2015 ...
$ Month      : num  1 2 3 3 4 2 4 3 4 3 ...
$ Day        : num  20 13 10 31 24 11 28 8 5 29 ...
$ RentalCount: num  445 40 456 38 23 42 310 240 22 360 ...
$ WeekDay    : Factor w/ 7 levels "1","2","3","4",..: 2 5 1 2 5 4 1 7 6 1 ...
$ Holiday    : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
$ Snow       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

The data is now prepared for training.

Clean up resources

If you're not going to continue with this tutorial, delete the TutorialDB database from your Azure SQL Database server.

From the Azure portal, follow these steps:

  1. From the left-hand menu in the Azure portal, select All resources or SQL databases.
  2. In the Filter by name... field, enter TutorialDB, and select your subscription.
  3. Select your TutorialDB database.
  4. On the Overview page, select Delete.

Next steps

In part one of this tutorial series, you completed these steps:

  • Import a sample database into an Azure SQL database using R
  • Load the data from the Azure SQL database into an R data frame
  • Prepare the data in R by identifying some columns as categorical

To create a machine learning model that uses data from the TutorialDB database, follow part two of this tutorial series: