Understand Apache Spark for U-SQL developers
Microsoft supports several Analytics services such as Azure Databricks and Azure HDInsight as well as Azure Data Lake Analytics. We hear from developers that they have a clear preference for open-source-solutions as they build analytics pipelines. To help U-SQL developers understand Apache Spark, and how you might transform your U-SQL scripts to Apache Spark, we've created this guidance.
It includes a number of steps you can take, and several alternatives.
Steps to transform U-SQL to Apache Spark
Transform your job orchestration pipelines.
If you use Azure Data Factory to orchestrate your Azure Data Lake Analytics scripts, you'll have to adjust them to orchestrate the new Spark programs.
Understand the differences between how U-SQL and Spark manage data
If you want to move your data from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2, you will have to copy both the file data and the catalog maintained data. Note that Azure Data Lake Analytics only supports Azure Data Lake Storage Gen1. See Understand Spark data formats
Transform your U-SQL scripts to Spark
Before transforming your U-SQL scripts, you will have to choose an analytics service. Some of the available compute services available are:
- Azure Data Factory DataFlow Mapping data flows are visually designed data transformations that allow data engineers to develop a graphical data transformation logic without writing code. While not suited to execute complex user code, they can easily represent traditional SQL-like dataflow transformations
- Azure HDInsight Hive Apache Hive on HDInsight is suited to Extract, Transform, and Load (ETL) operations. This means you are going to translate your U-SQL scripts to Apache Hive.
- Apache Spark Engines such as Azure HDInsight Spark or Azure Databricks This means you are going to translate your U-SQL scripts to Spark. For more information, see Understand Spark data formats
Caution
Both Azure Databricks and Azure HDInsight Spark are cluster services and not serverless jobs like Azure Data Lake Analytics. You will have to consider how to provision the clusters to get the appropriate cost/performance ratio and how to manage their lifetime to minimize your costs. These services are have different performance characteristics with user code written in .NET, so you will have to either write wrappers or rewrite your code in a supported language. For more information, see Understand Spark data formats, Understand Apache Spark code concepts for U-SQL developers, .Net for Apache Spark
Next steps
- Understand Spark data formats for U-SQL developers
- Understand Spark code concepts for U-SQL developers
- Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2
- .NET for Apache Spark
- Transform data using Hadoop Hive activity in Azure Data Factory
- Transform data using Spark activity in Azure Data Factory
- What is Apache Spark in Azure HDInsight