Deep learning (Preview)

Article
02/28/2024

Apache Spark in Azure Synapse Analytics enables machine learning with big data, providing the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data. There are several options when training machine learning models using Azure Spark in Azure Synapse Analytics: Apache Spark MLlib, Azure Machine Learning, and various other open-source libraries.

Warning

The GPU accelerated preview is limited to the Azure Synapse 3.1 (unsupported) and Apache Spark 3.2 (End of Support announced) runtimes.
Azure Synapse Runtime for Apache Spark 3.1 has reached its end of support as of January 26, 2023, with official support discontinued effective January 26, 2024, and no further addressing of support tickets, bug fixes, or security updates beyond this date.
Azure Synapse Runtime for Apache Spark 3.2 has reached its end of support as of July 8, 2023, with no further bug or feature fixes, but security fixes may be backported based on risk assessment, and it will be retired and disabled as of July 8, 2024.

GPU-enabled Apache Spark pools

To simplify the process for creating and managing pools, Azure Synapse takes care of pre-installing low-level libraries and setting up all the complex networking requirements between compute nodes. This integration allows users to get started with GPU- accelerated pools within just a few minutes. To learn more about how to create a GPU-accelerated pool, you can visit the quickstart on how to create a GPU-accelerated pool.

Note

GPU-accelerated pools can be created in workspaces located in East US, Australia East, and North Europe.
GPU-accelerated pools are only available with the Apache Spark 3.1 (unsupported) and 3.2 runtime.
You might need to request a limit increase in order to create GPU-enabled clusters.

GPU ML Environment

Azure Synapse Analytics provides built-in support for deep learning infrastructure. The Azure Synapse Analytics runtimes for Apache Spark 3 include support for the most common deep learning libraries like TensorFlow and PyTorch. The Azure Synapse runtime also includes supporting libraries like Petastorm and Horovod which are commonly used for distributed training.

Tensorflow

TensorFlow is an open source machine learning framework for all developers. It is used for implementing machine learning and deep learning applications.

For more information about Tensorflow, you can visit the Tensorflow API documentation.

PyTorch

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.

For more information about PyTorch, you can visit the PyTorch documentation.

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, and PyTorch. Horovod was developed to make distributed deep learning fast and easy to use. With this framework, an existing training script can be scaled up to run on hundreds of GPUs in just a few lines of code. In addition, Horovod can run on top of Apache Spark, making it possible to unify data processing and model training into a single pipeline.

To learn more about how to run distributed training jobs in Azure Synapse Analytics, you can visit the following tutorials: - Tutorial: Distributed training with Horovod and PyTorch - Tutorial: Distributed training with Horovod and Tensorflow

For more information about Horovod, you can visit the Horovod documentation,

Petastorm

Petastorm is an open source data access library which enables single-node or distributed training of deep learning models. This library enables training directly from datasets in Apache Parquet format and datasets that have already been loaded as an Apache Spark DataFrame. Petastorm supports popular training frameworks such as Tensorflow and PyTorch.

For more information about Petastorm, you can visit the Petastorm GitHub page or the Petastorm API documentation.

Next steps

This article provides an overview of the various options to train machine learning models within Apache Spark pools in Azure Synapse Analytics. You can learn more about model training by following the tutorial below:

Run SparkML experiments: Apache SparkML Tutorial
View libraries within the Apache Spark 3 runtime: Apache Spark 3 Runtime
Accelerate ETL workloads with RAPIDS: Apache Spark Rapids