Optimize Apache Spark applications in HDInsight

This article provides an overview of strategies to optimize Apache Spark applications on Azure HDInsight.

Overview

You might face below common Scenarios

The same spark job is slower than before in the same HDInsight cluster
The spark job is slower in HDInsight cluster than on-premise or other third party service provider
The spark job is slower in one HDI cluster than another HDI cluster

The performance of your Apache Spark jobs depends on multiple factors. These performance factors include:

Check if ResourceManager or NodeManager alerts
Check ResourceManager and NodeManager status in YARN > SUMMARY: All NodeManager should be in Started and only Active ResourceManager should be in Started

Check if Yarn UI is accessible through https://YOURCLUSTERNAME.azurehdinsight.net/yarnui/hn/cluster
Check if any exceptions or errors in ResourceManager log in /var/log/hadoop-yarn/yarn/hadoop-yarn-resourcemanager-*.log

See more information in Yarn Common Issues

Go to Ambari UI > YARN > SUMMARY, check CLUSTER MEMORY in ServiceMetrics
Check yarn queue metrics in details:

Go to Yarn UI, check Yarn scheduler metrics through https://YOURCLUSTERNAME.azurehdinsight.net/yarnui/hn/cluster/scheduler
Alternatively, you can check yarn scheduler metrics through Yarn Rest API. For example, curl -u "xxxx" -sS -G "https://YOURCLUSTERNAME.azurehdinsight.net/ws/v1/cluster/scheduler". For ESP, you should use domain admin user.

All executors resources: spark.executor.instances * (spark.executor.memory + spark.yarn.executor.memoryOverhead) and spark.executor.instances * spark.executor.cores. See more information in spark executors configuration
ApplicationMaster
- In cluster mode, use spark.driver.memory and spark.driver.cores
- In client mode, use spark.yarn.am.memory+spark.yarn.am.memoryOverhead and spark.yarn.am.cores

Note

yarn.scheduler.minimum-allocation-mb <= spark.executor.memory+spark.yarn.executor.memoryOverhead <= yarn.scheduler.maximum-allocation-mb

Compare your new application total resources with yarn available resources in your specified queue

We need to identify below symptoms through Spark UI or Spark History UI:

Which stage is slow
Are total executor CPU v-cores fully utilized in Event-Timeline in Stage tab
If using spark sql, what's the physical plan in SQL tab
Is DAG too long in one stage
Observe tasks metrics(input size, shuffle write size, GC Time) in Stage tab

There are many optimizations that can help you overcome these challenges, such as caching, and allowing for data skew.

In each of the following articles, you can find information on different aspects of Spark optimization.

spark.sql.shuffle.paritions is 200 by default. We can adjust based on the business needs when shuffling data for joins or aggregations.
spark.sql.files.maxPartitionBytes is 1G by default in HDI. The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.
AQE in Spark 3.0. See Adaptive Query Execution