Introducing Spark Machine Learning on SQL Server Big Data Clusters

Applies to: SQL Server 2019 (15.x)

Important

The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.

This article explains how to effectively use Spark for Machine Learning on SQL Server Big Data Clusters.

Spark Machine Learning in SQL Server Big Data Clusters

SQL Server Big Data Clusters enables machine learning scenarios and solutions using different technology stacks: SQL Server Machine Learning Services and Apache Spark ML.

To better understand when to use each technology stack, refer to Machine Learning guide for SQL Server Big Data Clusters. This guide covers Apache Spark ML.

For big data-based machine learning scenarios, the usage of HDFS for big data hosting and Apache Spark ML capabilities is a more cost-effective, scalable, and powerful approach. Yet this is far from an exhaustive list of the possibilities of what can be achieved with Spark Machine Learning, for a complete list of features see: Spark MLlib.

The next section provides a curated list of scenarios and references for Spark in SQL Server Big Data Clusters.

Building blocks for Spark Machine Learning on SQL Server Big Data Clusters

Learn Contents Link
SQL Server Big Data Clusters runtime for Apache Spark This will show what's included with each release SQL Server Big Data Clusters runtime for Apache Spark Guide
The Storage Pool How to store and use HDFS + Spark together to unlock data for machine learning Introducing the storage pool in SQL Server Big Data Clusters
Use notebook-based experiences and your tools of choice Connect Spark-Livy endpoint using your tools of choice Submit Spark jobs on SQL Server Big Data Clusters in Azure Data Studio
Submit Spark jobs on SQL Server big data cluster in Visual Studio Code
Use sparklyr in SQL Server big data cluster
How to install extra packages In the case a package is not provided out-of-the-box, install it Spark library management
How to troubleshoot In case it breaks Troubleshoot a pyspark notebook
Debug and Diagnose Spark Applications on SQL Server Big Data Clusters in Spark History Server
How to submit machine learning batch jobs Make ML training and batch scoring run using the command line Submit Spark jobs by using command-line tools
How to quickly move data between SQL Server and Spark Make SQL Server source and/or destination for your Spark ML scenarios. Usage of HDFS is not mandatory Use the Apache Spark Connector for SQL Server and Azure SQL
Spark model operationalization After training, operationalize using MLeap Create, export, and score Spark machine learning models on SQL Server Big Data Clusters
Data wrangling Along with Spark's powerful data wrangling capabilities, we ship PROSE Data Wrangling using PROSE Code Accelerator

Next steps

For more information, see Introducing SQL Server Big Data Clusters.