SQL Server Machine Learning and R Services (In-Database)
An in-database installation of machine learning operates within the context of a SQL Server database engine instance, providing R and Python external script support for resident data in your SQL Server instance. Because machine learning is integrated with SQL Server, you can keep analytics close to the data and eliminate the costs and security risks associated with data movement.
Because the database engine is multi-instance, you can install more than one instance of in-database analytics, or even older and newer versions side-by-side. Choices include either SQL Server 2017 Machine Learning Services (In-Database) with R and Python, or SQL Server 2016 R Services (In-Database) with just R.
Machine learning components can also be installed as instance-agnostic standalone servers. Generally, we recommend that you treat (Standalone) and (In-Database) installations as mutually exclusive to avoid resource contention, but if you have sufficient resources, there are no prohibitions against installing them both on the same physical computer.
Choosing between in-database and standalone analytics
Understanding your development requirements can help you choose between (In-Database) and (Standalone) approaches. A standalone server is simpler to configure and manage if you want maximum flexibility in how it is used, or if you want to also connect to a variety of data sources outside of SQL Server.
In-database analytics are designed for deep integration with data within SQL Server. You can write T-SQL queries that call R or Python functions and execute the script in SQL Server Management Studio or any tool or app used for external or embedded T-SQL. If you need to run R or Python code in SQL Server, either by using stored procedures or by using the SQL Server instance as the compute context, you must install in-database analytics. This option provides maximum data security and integration with SQL Server tools.
Both in-database and standalone servers can alleviate the memory and processing constraints of open-source R and Python. Both options include the same packages and tools, with the ability to load and process large amounts of data on multiple cores and aggregate the results into a single consolidated output. The functions and algorithms are engineered for both scale and utility: delivering predictive analytics, statistical modeling, data visualizations, and leading-edge machine learning algorithms in a commercial server product engineered and supported by Microsoft.
Components of an in-database installation
SQL Server 2016 is R only. SQL Server 2017 supports R and Python. The following table describes the features in each version. With the exception of the SQL Server Launchpad service, this table is identical to the one provided in the standalone server article.
|SQL Server Launchpad service||A service that manages communications between the external R and Python runtimes and the SQL Server instance.|
|R packages||RevoScaleR is the primary library for scaleable R with functions for data manipulation, transformation, visualzation, and analysis.
MicrosoftML (R) adds machine learning algorithms to create custom models for text analysis, image analysis, and sentiment analysis.
mrsdeploy offers web service deployment (in SQL Server 2017 only).
olapR is for specifying MDX queries in R.
|Microsoft R Open (MRO)||MRO is Microsoft's open-source distribution of R. The package and interpreter are included. Always use the version of MRO bundled in setup.|
|R tools||R console windows and command prompts are standard tools in an R distribution. Find them at \Program files\Microsoft SQL Server\140\R_SERVER\bin\x64.|
|R Samples and scripts||Open-source R and RevoScaleR packages include built-in data sets so that you can create and run script using pre-installed data. Look for them at \Program files\Microsoft SQL Server\140\R_SERVER\library\datasets and \library\RevoScaleR.|
|Python packages||revoscalepy is the primary library for scaleable Python with functions for data manipulation, transformation, visualzation, and analysis.
microsoftml (Python) adds machine learning algorithms to create custom models for text analysis, image analysis, and sentiment analysis.
|Python tools||The built-in Python command line tool is useful for ad hoc testing and tasks. Find the tool at \Program files\Microsoft SQL Server\140\PYTHON_SERVER\python.exe.|
|Anaconda||Anaconda is an open-source distribution of Python and essential packages.|
|Python samples and scripts||As with R, Python includes built-in data sets and scripts. Find the revoscalepy data at \Program files\Microsoft SQL Server\140\PYTHON_SERVER\lib\site-packages\revoscalepy\data\sample-data.|
|Pre-trained models in R and Python||Pre-trained models are supported and usable on a standalone server, but you cannot install them through SQL Server Setup. The setup program for Microsoft Machine Learning Server provides the models, which you can install free of charge. For more information, see Install pretrained machine learning models on SQL Server.|
Get started step-by-step
Start with setup, attach the binaries to your favorite development tool, and write your first script.
Step 1: Install the software
Install either one of these versions:
Step 2: Configure a development tool
Configure your development tools to use the Machine Learning Server binaries. For more information about Python, see Link Python binaries. For instructions on how to connect in R Studio, see Using Different Versions of R and point the tool to C:\Program Files\Microsoft SQL Server\140\R_SERVER\bin\x64. You could also try R Tools for Visual Studio.
Data scientists typically use R or Python on their own laptop or development workstation, to explore data, and build and tune predictive models until a good predictive model is achieved.
With in-database analytics in SQL Server, there is no need to change this process. After installation is complete, you can run R or Python code on SQL Server either locally or remotely:
Use the IDE you prefer. R Services (In-Database) client components provide the data scientist with all the tools needed to experiment and develop. These tools include the R runtime, the Intel math kernel library to boost the performance of standard R operations, and a set of enhanced R packages that support executing R code in SQL Server.
Work remotely or locally. Data scientists can connect to SQL Server and bring the data to the client for local analysis, as usual. However, a better solution is to use the RevoScaleR or revoscalepy APIs to push computations to the SQL Server computer, avoiding costly and insecure data movement.
Embed R or Python scripts in Transact-SQL stored procedures. When your code is fully optimized, wrap it in a stored procedure to avoid unnecessary data movement and optimize data processing tasks.
Step 3: Write your first script
Call R or Python functions from within T-SQL script:
- R: Use R code in Transact-SQL
- R: In-database analytics for SQL developers
- Python: Run Python using T-SQL
- Python: In-database analytics for SQL developers
Choose the best language for the task. R is best for statistical computations that are difficult to implement using SQL. For set-based operations over data, leverage the power of SQL Server to achieve maximum performance. Use the in-memory database engine for very fast computations over columns.
Step 4: Optimize your solution
When the model is ready to scale on enterprise data, the data scientist often works with the DBA or SQL developer to optimize processes such as:
- Feature engineering
- Data ingestion and data transformation
Traditionally data scientists using R have had problems with both performance and scale, especially when using large dataset. That is because the common runtime implementation is single-threaded and can accommodate only those data sets that fit into the available memory on the local computer. Integration with SQl Server Machine Learning Services provides multiple features for better performance, with more data:
RevoScaleR: This R package contains implementations of some of the most popular R functions, redesigned to provide parallelism and scale. The package also includes functions that further boost performance and scale by pushing computations to the SQL Server computer, which typically has far greater memory and computational power.
revoscalepy. This Python library, available in SQL Server 2017, implements the most popular functions in RevoScaleR, such as remote compute contexts, and many algorithms that support distributed processing.
Step 5: Deploy and Consume
After the script or model is ready for production use, a database developer might embed the code or model in a stored procedure, so that the saved R or Python code can be called from an application. Storing and running R code from SQL Server has many benefits: you can use the convenient Transact-SQL interface, and all computations take place in the database, avoiding unnecessary data movement.
Secure and extensible. R Services (In-Database) uses a new extensibility architecture that keeps your database engine secure and isolates R and Python sessions. You also have control over the users who can execute scripts, and you can specify which databases can be accessed by code. You can control the amount of resources allocated to the runtime, to prevent massive computations from jeopardizing the overall server performance.
Scheduling and auditing. When external script jobs are run in SQL Server, you can control and audit the data used by data scientists. You can also schedule jobs and author workflows containing external R or Python scripts, just like you would schedule any other T-SQL job or stored procedure.
To take advantages of the resource management and securty features in SQL Server, the deployment process might include these tasks:
- Converting yourcode to a function that can run optimally in a stored procedure
- Setting up security and locking down packages used by a particular task
- Enabling resource governance (requires the Enterprise edition)