Volume 32 Number 7
Doing Data Science and AI with SQL Server
By Wee Hyong | July 2017
Data is an important asset for every business. Whether you’re in the retail, health care, telecommunications, utilities or financial businesses, you’re familiar with the following two use cases:
- In an online transaction processing (OLTP) scenario, transactional data is stored in a database. The transaction data is produced by various line-of-business (LOB) applications.
- In a data warehousing scenario, data is acquired from various heterogeneous data sources, transformed and cleansed, and loaded into data warehouses. The consolidated data provides the single source of truth for business reporting and dashboards. At the same time, it also enables interactive analysis via multi-dimensional online analytical processing (OLAP) cubes, and tabular models.
Getting from raw data to insights empowers business decision makers to gain a deeper understanding into each aspect of the business and helps them react to new business situations quickly. For example, consider a retail scenario. The business analyst notices that sales are dropping for specific retail stores. The business analyst wants to drill down to understand the details on what’s causing the drop in sales. By being able to run the analysis (aggregating, joining of data from multiple data sources, filtering and so on) on a large amount of data, it enables deep analysis of customer behavior and trends in the retail stores. Microsoft SQL Server powers these mission-critical applications.
Many companies have started on digital transformation to modernize their data platform to keep pace with the ever-growing requirements on the type of data that needs to be stored and the volume in which the data is being acquired.
As part of this digital transformation, advanced analytics plays an important role. Specifically, companies have been either building up data science teams within their companies or leveraging external resources to do data science. They use data science to distill data assets into nuggets of gold that can help them proactively deliver personalized customer experiences (personalized Web sites, product recommendations, customer lifetime value and so on), reduce downtime for equipment (predicting remaining useful lifetime) and more. The potential use of data science and how it can literally change businesses is exciting.
Some common use cases (non-exhaustive) of data science include the following:
Identifying Loan Credit Risk: A lending institution (a credit bureau) might want to leverage loan credit risk models to determine the borrowers that are likely to default and reduce the number of loans given to these high-risk borrowers.
Managing Customer Churn: Customer churn models have been used extensively (in retail and by telecommunication providers). For example, customers leveraging mobile services offered by telecommunication providers have a rich variety of choices and can easily switch between service providers. Managing customer churn is important to reduce customer acquisition costs and maintain a high-quality service. In addition, retail companies are using churn models to predict customers that are most likely to churn and to understand the key factors that lead to those churns.
Reducing Hospital Readmission: Reducing readmission rates for various medical conditions (heart attack, pneumonia, coronary artery bypass surgery, to name a few) is important to hospitals. In the United States, hospitals face penalties if the readmission rate is too high. Hospitals leverage predictive models for predicting patients that are more susceptible to being readmitted within 30 days. This helps them understand the root causes for the 30-day readmission, and helps them work toward addressing them.
This presents an exciting opportunity for database professionals and developers to either work with data scientists, or put on a data scientist hat to build predictive models that can help to assess credit loan risk, manage customer churn, reduce hospital admissions and more. The possibilities for developers to turn all these raw data assets sitting in the database to golden, strategic insights is exciting.
This article shows how you can work with data science and artificial intelligence (AI) with SQL Server. You’ll also learn how to jump-start your journey of using R and Python with SQL Server.
Why Doing Data Science with SQL Server Matters
What does doing data science mean and why does it matter to the database person? Today, most data scientists first figure out how to connect to many data sources (databases included), bring the data out from the database, and use the historical data to train and subsequently test the machine learning models that they’ve built.
A typical approach used by data scientists is to read the data from the database into the client that they’re using for building the model. Once the data is read, they combine the data with other data sources. For data scientists developing the models using R, packages like dplyr are commonly used for performing aggregation, joins and for filtering. With the data transformed into the right shape, data scientists continue the data modeling process, and start doing feature engineering. As part of feature engineering, new features (such as adding new columns in a SQL Server table) might get created, existing features might get transformed (scaling it to -1 to 1, or 0 to 1, applying logarithmic transformation, computing the z-score, binnning the data, and so on) or removed. Feature engineering plays a very important role in laying the groundwork needed for a good predictive model. Once all these steps are completed, the data scientist develops the models and validates it using test data before figuring out an operationalization plan for the model to be deployed to production so that applications can consume them.
At this point, as a database person, you might ask, “Why do I need to move the data out from the database to do data science? Should we push the processing of joins and aggregations (Sum, Min, Max and so on) into the database?”
Why does it make sense to do this in the database? First, data movement is expensive. If the machine learning models can run where the data is stored, this removes the need to move data between the database and the client application. Second, a new working copy of the data is extracted from the database and stored external to the database. The implication is that many of the security policies and audits that apply to data stored in the database can no longer be enforced. Third, if the computation of joins and aggregations can be done where the data is located, you can leverage decades of database innovations (leveraging indexes—clustered and non-clustered, in-memory tables, column stores, high-availability and so on). If training the model can be done where the data is stored, it can lead to performance improvements.
In addition, if the data science project involves working with spatial data, temporal data or semi-structured data, you can leverage SQL Server capabilities that let you do this efficiently. For example, if you’re working on a data science project (say a land-use classification problem) where you must manipulate spatial data, the geography and geometry data types in SQL Server will provide a good way to store the spatial data. Once it’s stored as spatial data in SQL Server, you can leverage SQL Server spatial functions to query for nearest neighbors, compute the spatial distance using different spatial reference systems and more. The spatial data can be indexed to facilitate efficient query processing.
As a database professional and developer, you have tremendous knowledge and value to bring to a data science project. By doing data science and AI where the data resides, there are many benefits. These include being able to take advantage of the enterprise-grade performance, scale, security and reliability that you’ve come to expect from SQL Server over the years. More important, you eliminate the need for expensive data movement.
Figure 1 illustrates the difference between doing data science and AI outside of the database. From the application perspective, a developer doesn’t need to learn new methods to tap the power of AI teeming in SQL Server. It connects to it the same way it connects to a database today, and invokes SQL Server-stored procedures, which encapsulates the R or Python code. The stored procedure has just become an intelligent AI stored procedure.
Figure 1 Doing Data Science and AI Where the Data Is Stored
Another important consideration in data science projects is operationalization. The predictive model that has been developed by data scientists needs to be deployed into a production environment for it to be used by applications. With the release of SQL Server 2016 R Services, you can wrap R code as part of the stored procedures. After training is done, the predictive models are stored as varbinary(max) in a database table.
An application (.NET, Java, Node.js and more) would connect to SQL Server and invoke the stored procedures to use the predictive model for making predictions on new instances of data. Continuing the momentum, SQL Server 2017 CTP2 added Python support. You now have the best of multiple worlds: the ability to write code in R or Python, leverage the rich set of R and Python libraries for machine learning and deep learning, and consume the predictive models in any application.
Using Stored Procedures for Machine Learning and AI
By encapsulating the machine learning and AI models as part of the SQL Server stored procedure, it lets SQL Server serve AI with the data. There are other advantages for using stored procedures for operationalizing machine learning and AI (ML/AI). These include:
Applications can leverage existing database drivers to connect to SQL Server: Many programming languages have database drivers that enable them to connect to SQL Server. These database drivers (OLEDB, ODBC, JDBC, MSPHPSQL and Node.js Driver for SQL Server) are used by application developers to develop cutting-edge applications that “talk” to SQL Server.
In addition, companies might have existing LOB applications that are already operational. By leveraging ML/AI-stored procedures in SQL Server, these LOB applications can easily evolve into intelligent applications. With the R or Python code needed to work with the ML/AI models encapsulated in stored procedures, application developers can now leverage their ML/AI stored procedures as is (without requiring new libraries or learning new database access patterns). To the application layer, the intelligent ML/AI stored procedure behaves just like any SQL Server stored procedure.
Being backed by the full spectrum of SQL Server enterprise-ready capabilities: Some considerations include ...
- Where can I host the model for doing inference?
- Which users can access the ML/AI model?
- When the model is used for prediction (aka “inference”), it might need to access specific data in the database. In what security context should the model and associated R/Python code execute?
- How can I ensure the R/Python code doesn’t use up all the SQL Server resources?
SQL Server provides enterprise-ready capabilities from using row-level security to limit the data that can be accessed, to providing database admins with both server and database scoped database audits, to enabling ownership-chaining for SQL Server securable, to being able to sign stored procedures with a certificate or asymmetric key, resource governance and more. These enterprise-ready SQL Server capabilities can be used by the ML/AI stored procedures as is, without requiring the data scientist to reinvent the wheel for serving data at scale. Most important, the DBAs today can leverage their existing skills to secure and manage the ML/AI stored procedures.
Mature development tools to develop the ML/AI stored procedure: As a database developer, you can develop the stored procedure and the R and Python code in one place: Visual Studio. With the availability of SQL Server Data Tools for Visual Studio, R Tools for Visual Studio, and Python Tools for Visual Studio, you can do development of the T-SQL, R, or Python code, check it into a source control system, write unit tests, automate testing, and perform code review, and more. Database developers and data scientists can now work together to develop the ML/AI stored procedures, each focusing on their respective areas of expertise.
Steps to Get Started with SQL Server, Data Science and AI
There has never been a more exciting time and better opportunity for us as database professionals and developers to work with data science and AI with SQL Server. You can get started in three easy steps:
Install SQL Server 2016 or SQL Server 2017 CTP2. When installing SQL Server 2017 CTP2, you select the type of in-database machine learning services that you want to install. You can choose to use R, Python or both. Once SQL Server completes setup, you’ll be able to start using R or Python code as part of stored procedures, as shown in Figure 2.
Figure 2 Using SQL Server 2017 CTP2 Setup to Install Machine Learning Services (In-Database)
Note: If you’re using SQL Server 2016, only R-Services (In-Database) will be shown at setup.
You can refer to bit.ly/2qXoyyC for more information on setting up R and Python with SQL Server.
Enable external script. To use R or Python code in the stored procedure, you’ll need to configure SQL Server to allow external scripts. To enable external scripts, you’ll need to run the sp_configure and reconfigure commands (after the T-SQL code is successfully executed, you’ll need to restart the SQL Server service):
exec sp_configure 'external scripts enabled', 1 reconfigure with override
After running the commands, you’ll see the following output:
Configuration option 'external scripts enabled' changed from 1 to 1. Run the RECONFIGURE statement to install.
Hello, AI! You’re now ready to do data science and AI with SQL Server. Using either Visual Studio, Visual Studio Code or SQL Server Management Studio, you can connect to the SQL Server instance with R or Python. You can run the code provided in Figure 3 to make sure that R and Python are installed correctly. The code will output “Hello AI.”
Figure 3 Code to Ensure R and Python Are Installed Correctly
Code for R exec sp_execute_external_script @language =N'R', @script=N'OutputDataSet<-InputDataSet', @input_data_1 =N'select ''Hello AI'' as txt' with result sets (([txt] nvarchar(8))); go Code for Python exec sp_execute_external_script @language =N'Python', @script=N'OutputDataSet = InputDataSet', @input_data_1 =N'select ''Hello AI'' as txt' with result sets (([txt] nvarchar(8)));
If you need to install additional R libraries, you should set lib.SQL as the location of the SQL Server library. Similarly, if you need to install additional Python libraries, you can do a pip install of the relevant Python libraries and set the PYTHONPATH to point to where Python has been installed.
You can find the default locations of the R library files at :\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\R_SERVICES\library and Python library files at :\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\library.
Let’s dive deeper into what’s in a stored procedure (with R code) that’s used for training a customer churn classification model. Figure 4 shows a stored procedure that trains a customer churn classification model by building an ensemble of decision trees using R. Figure 5 shows a similar stored procedure that trains a similar churn model using Python. Database developers leverage familiar skills of using T-SQL queries to select data from SQL server tables. The data is used as inputs to the R or Python code.
Figure 4 Creating a Stored Procedure to Train a Decision Forest Model Using R
CREATE PROCEDURE trainRDecisionForestModel AS BEGIN execute sp_execute_external_script @language = N'R', @script = N' require("RevoScaleR"); labelVar = "churn" trainVars <- rxGetVarNames(telcoCDR_Data_train) trainVars <- trainVars[!trainVars %in% c(labelVar)] temp <- paste(c(labelVar, paste(trainVars, collapse = "+")), collapse = "~") formula <- as.formula(temp) rx_forest_model <- rxDForest(formula = formula, data = telcoCDR_Data_train, nTree = 8, maxDepth = 32, mTry = 2, minBucket=1, replace = TRUE, importance = TRUE, seed=8, parms=list(loss=c(0,4,1,0))) rxDForest_model <- data.frame( payload = as.raw(serialize(rx_forest_model, connection=NULL))); ' @input_data_1 = N'select * from telcoCDR_Data_train' @input_data_1_name = N'telcoCDR_Data_train' @output_data_1_name = N'rxDForest_model' with result sets ((model varbinary(max))); end;
Figure 5 Creating a Stored Procedure to Train a Random Forest Model Using Python
CREATE PROCEDURE trainPythonRandomForestModel (@trained_model varbinary(max) OUTPUT) AS BEGIN execute sp_execute_external_script @language = N'Python', @script = N' df = churn_train_data # Get all the columns columns = df.columns.tolist() # Specify the label column target = "churn" import numpy as np from sklearn.ensemble import RandomForestClassifier churn_model = RandomForestClassifier(n_estimators=20, max-depth=5) churn_model.fit(df[columns], df[target]) import pickle #Serialize the model as a binary object trained_model = pickle.dumps(lin_model) @input_data_1 = N'select "TotalMinutesUsedLastMonth", "State", "CallDropRate", "UnPaidBalance", "TotalCallDuration", "TotalDataUsageMB" from dbo.telco_churn_data where Year = 2017' @input_data_1_name = N'churn_train_data' @params = N'@trained_model varbinary(max) OUTPUT' @trained_model = @trained_model OUTPUT; END;
Once the model is trained, it’s serialized and returned as varbinary(max). The model can then be stored in a SQL Server table.
The complete sample code for a customer churn model can be found at aka.ms/telcochurnsamples.
With SQL Server 2017 CTP2, you can run Python code in the stored procedures. By supporting Python in SQL Server, it opens up new opportunities for you to use many of the deep learning toolkits (CNTK, TensorFlow and more), which provide Python APIs. In addition, the deep-learning toolkits enable you to specify the use of GPUs when training your model. You can now use SQL Server 2017 to perform intensive deep-learning jobs on text, images, and unstructured data that are stored in SQL Server, and then operationalize and do inference with SQL Server. Very exciting!
SQL Server has evolved over the years into a top-notch, enterprise-ready, scalable and hybrid data platform. This lets companies build intelligent, mission-critical applications, backed by decades of database innovations from indexes, spatial indexes, in-memory, column stores, high availability, resource governance and more. With the SQL Server 2017 release, with built-in R and Python support, SQL Server is in a unique position to fuel innovations that database professionals and developers can co-create with the data science and AI communities. The possibilities are endless.
Wee Hyong has worn many hats in his career: developer, program/product manager, data scientist, researcher and strategist. His gamut of experience spanning industry and research has given him unique abilities to help organizations accelerate their digital transformations using data science and artificial intelligence. You can follow him on Twitter: @weehyong.
Thanks to the following Microsoft technical expert for reviewing this article: Joy Qiao
Joy Qiao is a senior solution architect at Microsoft, and is responsible for driving end-to-end big data analytics and machine learning solution architectures among the partner eco-system. Joy has over 15 years of IT industry experience including 11 years at Microsoft working as technical lead/architect roles at various Microsoft Azure teams, as well as senior consultant/architect in the Microsoft services team.