Deep learning (Preview)
Apache Spark in Azure Synapse Analytics enables machine learning with big data, providing the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data. There are several options when training machine learning models using Azure Spark in Azure Synapse Analytics: Apache Spark MLlib, Azure Machine Learning, and various other open-source libraries.
Warning
- The GPU accelerated preview is limited to the Azure Synapse 3.1 (unsupported) and Apache Spark 3.2 (End of Support announced) runtimes.
- Azure Synapse Runtime for Apache Spark 3.1 has reached its end of support as of January 26, 2023, with official support discontinued effective January 26, 2024, and no further addressing of support tickets, bug fixes, or security updates beyond this date.
- Azure Synapse Runtime for Apache Spark 3.2 has reached its end of support as of July 8, 2023, with no further bug or feature fixes, but security fixes may be backported based on risk assessment, and it will be retired and disabled as of July 8, 2024.
GPU-enabled Apache Spark pools
To simplify the process for creating and managing pools, Azure Synapse takes care of pre-installing low-level libraries and setting up all the complex networking requirements between compute nodes. This integration allows users to get started with GPU- accelerated pools within just a few minutes. To learn more about how to create a GPU-accelerated pool, you can visit the quickstart on how to create a GPU-accelerated pool.
Note
- GPU-accelerated pools can be created in workspaces located in East US, Australia East, and North Europe.
- GPU-accelerated pools are only available with the Apache Spark 3.1 (unsupported) and 3.2 runtime.
- You might need to request a limit increase in order to create GPU-enabled clusters.
GPU ML Environment
Azure Synapse Analytics provides built-in support for deep learning infrastructure. The Azure Synapse Analytics runtimes for Apache Spark 3 include support for the most common deep learning libraries like TensorFlow and PyTorch. The Azure Synapse runtime also includes supporting libraries like Petastorm and Horovod which are commonly used for distributed training.
Tensorflow
TensorFlow is an open source machine learning framework for all developers. It is used for implementing machine learning and deep learning applications.
For more information about Tensorflow, you can visit the Tensorflow API documentation.
PyTorch
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
For more information about PyTorch, you can visit the PyTorch documentation.
Horovod
Horovod is a distributed deep learning training framework for TensorFlow, Keras, and PyTorch. Horovod was developed to make distributed deep learning fast and easy to use. With this framework, an existing training script can be scaled up to run on hundreds of GPUs in just a few lines of code. In addition, Horovod can run on top of Apache Spark, making it possible to unify data processing and model training into a single pipeline.
To learn more about how to run distributed training jobs in Azure Synapse Analytics, you can visit the following tutorials: - Tutorial: Distributed training with Horovod and PyTorch - Tutorial: Distributed training with Horovod and Tensorflow
For more information about Horovod, you can visit the Horovod documentation,
Petastorm
Petastorm is an open source data access library which enables single-node or distributed training of deep learning models. This library enables training directly from datasets in Apache Parquet format and datasets that have already been loaded as an Apache Spark DataFrame. Petastorm supports popular training frameworks such as Tensorflow and PyTorch.
For more information about Petastorm, you can visit the Petastorm GitHub page or the Petastorm API documentation.
Next steps
This article provides an overview of the various options to train machine learning models within Apache Spark pools in Azure Synapse Analytics. You can learn more about model training by following the tutorial below:
- Run SparkML experiments: Apache SparkML Tutorial
- View libraries within the Apache Spark 3 runtime: Apache Spark 3 Runtime
- Accelerate ETL workloads with RAPIDS: Apache Spark Rapids
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for