Loan Classification using SQL Server 2016 R Services
R Services (In-database) provides a platform for developing and deploying intelligent applications that uncover new insights. You can use the rich and powerful R language and the many packages from the community to create models and generate predictions using your SQL Server data. Because R Services (In-database) integrates the R language with SQL Server, you can keep analytics close to the data and eliminate the costs and security risks associated with data movement.
In this post, we will see how we can leverage SQL Server 2016 as a Scoring Engine to predict "bad" loans. Loans that indicate good repayment behavior are considered "good" and loans that indicate less than perfect repayment behavior are considered "bad".
The Azure Data Science VM comes pre-installed with SQL Server 2016 Developer edition and can be used readily for our scoring experiment. We will use the publicly available dataset of Lending Club loan performance.
This experiment consists of 6 steps. We shall define stored procedures in SQL for each of these steps and execute them in order.
Create a DB, say ‘lendingclub’ in SQL Server
CREATE DATABASE [lendingclub]
Download the loan data into a folder (say C:\lendingclub)
There are 6 csv files: LoanStats3a.csv, LoanStats3b.csv, LoanStats3c.csv, LoanStats3d.csv, LoanStats_2016Q1.csv, LoanStats_2016Q2.csv. Use the following clean-up steps for the csv files :
1. Remove this line at the beginning of file : “Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)”
2. Remove these lines at the end of file : “Total amount funded in policy code 1: Total amount funded in policy code 2: ”
3. Remove the line “Loans that do not meet the credit policy” in LoanStats3a.csv at line number 39788, 39789, 39790.
Now, let us import the data from these 6 csv files to a single table [dbo].[LoanStats] in lendingclub DB. There are many ways to import csv data into SQL (using bcp, powershell etc) but we are going to use something quick and easy – sp_execute_external_script !
Executing the above stored procedure creates a table [dbo].[LoanStats] with loan data :
Once the data is loaded, let us do some cleaning using SQL and also identify good/bad loans based on the value of the column [loan_status] :
Before we run feature-selection, let’s install the required R packages by following this guide :
Right Click C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\bin\x64\Rgui.exe and Run as Administrator. Run the following R code to install the packages:
install.packages(c("reshape2","ggplot2","ROCR","plyr","Rcpp","stringr","stringi","magrittr","digest","gtable", "proto","scales","munsell","colorspace","labeling","gplots","gtools","gdata","caTools","bitops"), lib = "C:\\Program Files\\Microsoft SQL Server\\MSSQL13.MSSQLSERVER\\R_SERVICES\\library")
There are many ways to perform feature selection/variable selection. A quick and easy way is to use ggplot to plot the distribution of is_bad variable for each numeric variable. From this plot, we can identify variables which have significant differences in the is_bad=0(good) and is_bad=1(bad) population.
Stored procedure to plot distribution.
Executing the above stored procedure creates a JPEG image in C:\temp\plots
Good predictors have significant differences in the distribution lines factor(is_bad)=0 (Red) and factor(is_bad)=1 (Blue), for example : int_rate. ID variables like id, member_id will also be plotted in the above image (since they are numeric as well), but they cannot be used as predictor variables - hence ignore such variables. Let us choose some predictors variables by quickly analyzing the above image and use them in our Random Forest Model. (NOTE: This is not the final/best list of predictors, you may choose other methods to find predictors as well)
revol_util, int_rate, mths_since_last_record, annual_inc_joint, dti_joint, total_rec_prncp, all_util
We shall use SQL to split the Loan Data into Training Data 75% ([dbo].[LoanStatsTrain]) and Test Data 25% ([dbo].[LoanStatsTest]). Training data will be used to build the model and the built model shall be applied on test data for evaluation.
Stored procedure to split the data
Executing the above stored procedure creates 2 tables [dbo].[LoanStatsTrain] and [dbo].[LoanStatsTest]
Build a Random Forest Model using the variables we identified in Feature Selection and store the model in a table [dbo].[models] to be used later for scoring.
Executing the above stored procedure builds a random forest model and stores it in [dbo].[models]
SELECT * FROM [dbo].[models]
Stored Procedure to use the model in [dbo].[models] and score test data [dbo].[LoanStatsTest]
Executing the above stored procedure creates a table [dbo].[LoanStatsPredictions] which contains the predictions in is_bad_Pred column
SELECT TOP 10 * FROM [dbo].[LoanStatsPredictions]
Executing the above stored procedure will save the ROC curve in C:\temp\plots as well as output the Area Under The Curve measure (AUC)
NOTE : We can improve scoring performance (number of predictions per second) using In-Memory Tables, Columnstore Indexes and Resource Governor Configurations. These steps are detailed with github code in this blog post : A walkthrough of Loan Classification using SQL Server 2016 R Services
Lending Club Statistics Machine Learning for Predicting Bad Loans Variable Importance Plot and Variable Selection Machine Learning Templates with SQL Server 2016 R Services SQL Server R Services Tutorials Provision the Microsoft Data Science Virtual Machine sp_execute_external_script (Transact-SQL) Explore and Visualize the Data (In-Database Advanced Analytics Tutorial) Selecting Rows Randomly from a Large Table Receiver operating characteristic Area under the curve The Area Under an ROC Curve