Tutorial: Develop a clustering model in R with SQL machine learning

Article
05/29/2024

Applies to: SQL Server 2016 (13.x) and later Azure SQL Managed Instance

In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services or on Big Data Clusters to categorize customer data.

In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services to cluster customer data.

In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server R Services to cluster customer data.

In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in Azure SQL Managed Instance Machine Learning Services to cluster customer data.

In part one of this series, you'll set up the prerequisites for the tutorial and then restore a sample dataset to a database. In parts two and three, you'll develop some R scripts in an Azure Data Studio notebook to analyze and prepare this sample data and train a machine learning model. Then, in part four, you'll run those R scripts inside a database using stored procedures.

Clustering can be explained as organizing data into groups where members of a group are similar in some way. For this tutorial series, imagine you own a retail business. You'll use the K-Means algorithm to perform the clustering of customers in a dataset of product purchases and returns. By clustering customers, you can focus your marketing efforts more effectively by targeting specific groups. K-Means clustering is an unsupervised learning algorithm that looks for patterns in data based on similarities.

In this article, you'll learn how to:

Restore a sample database

In part two, you'll learn how to prepare the data from a database to perform clustering.

In part three, you'll learn how to create and train a K-Means clustering model in R.

In part four, you'll learn how to create a stored procedure in a database that can perform clustering in R based on new data.

Prerequisites

SQL Server Machine Learning Services with the Python language option - Follow the installation instructions in the Windows installation guide or the Linux installation guide. You can also enable Machine Learning Services on SQL Server Big Data Clusters.

SQL Server Machine Learning Services with the R language option - Follow the installation instructions in the Windows installation guide.

Azure SQL Managed Instance Machine Learning Services. For information, see the Azure SQL Managed Instance Machine Learning Services overview.
SQL Server Management Studio (SSMS) - Use SSMS to restore the sample database to Azure SQL Managed Instance. To download, see SQL Server Management Studio.

Azure Data Studio. You'll use a notebook in Azure Data Studio for SQL. For more information about notebooks, see How to use notebooks in Azure Data Studio.
R IDE - This tutorial uses RStudio Desktop.
RODBC - This driver is used in the R scripts you'll develop in this tutorial. If it's not already installed, install it using the R command install.packages("RODBC"). For more information on RODBC, see CRAN - Package RODBC.

Restore the sample database

The sample dataset used in this tutorial has been saved to a .bak database backup file for you to download and use. This dataset is derived from the tpcx-bb dataset provided by the Transaction Processing Performance Council (TPC).

Note

If you are using Machine Learning Services on Big Data Clusters, see how to Restore a database into the SQL Server big data cluster master instance.

Download the file tpcxbb_1gb.bak.
Follow the directions in Restore a database from a backup file in Azure Data Studio, using these details:
- Import from the tpcxbb_1gb.bak file you downloaded.
- Name the target database tpcxbb_1gb.
You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:
```
USE tpcxbb_1gb;
SELECT * FROM [dbo].[customer];
```

Download the file tpcxbb_1gb.bak.
Follow the directions in Restore a database to a Managed Instance in SQL Server Management Studio, using these details:
- Import from the tpcxbb_1gb.bak file you downloaded.
- Name the target database tpcxbb_1gb.
You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:
```
USE tpcxbb_1gb;
SELECT * FROM [dbo].[customer];
```

Clean up resources

If you're not going to continue with this tutorial, delete the tpcxbb_1gb database.

Next step

In part one of this tutorial series, you completed these steps:

Installed the prerequisites
Restored a sample database

To prepare the data for the machine learning model, follow part two of this tutorial series:

Prepare data to perform clustering

Share via