Debug Apache Spark jobs running on Azure HDInsight
In this article, you learn how to track and debug Spark jobs running on HDInsight clusters using the YARN UI, Spark UI, and the Spark History Server. For this article, we start a Spark job using a notebook available with the Spark cluster, Machine learning: Predictive analysis on food inspection data using MLLib. You can use the steps below to track an application that you submitted using any other approach as well, for example, spark-submit.
You must have the following:
- An Azure subscription. See Get Azure free trial.
- An Apache Spark cluster on HDInsight. For instructions, see Create Apache Spark clusters in Azure HDInsight.
- You should have started running the notebook, Machine learning: Predictive analysis on food inspection data using MLLib. For instructions on how to run this notebook, follow the link.
Track an application in the YARN UI
Launch the YARN UI. From the cluster blade, click Cluster Dashboard, and then click YARN.
Alternatively, you can also launch the YARN UI from the Ambari UI. To launch the Ambari UI, from the cluster blade, click Cluster Dashboard, and then click HDInsight Cluster Dashboard. From the Ambari UI, click YARN, click Quick Links, click the active resource manager, and then click ResourceManager UI.
Because you started the Spark job using Jupyter notebooks, the application has the name remotesparkmagics (this is the name for all applications that are started from the notebooks). Click the application ID against the application name to get more information about the job. This launches the application view.
For such applications that are launched from the Jupyter notebooks, the status is always RUNNING until you exit the notebook.
From the application view, you can drill down further to find out the containers associated with the application and the logs (stdout/stderr). You can also launch the Spark UI by clicking the linking corresponding to the Tracking URL, as shown below.
Track an application in the Spark UI
In the Spark UI, you can drill down into the Spark jobs that are spawned by the application you started earlier.
To launch the Spark UI, from the application view, click the link against the Tracking URL, as shown in the screen capture above. You can see all the Spark jobs that are launched by the application running in the Jupyter notebook.
Click the Executors tab to see processing and storage information for each executor. You can also retrieve the call stack by clicking on the Thread Dump link.
Click the Stages tab to see the stages associated with the application.
Each stage can have multiple tasks for which you can view execution statistics, like shown below.
From the stage details page, you can launch DAG Visualization. Expand the DAG Visualization link at the top of the page, as shown below.
DAG or Direct Aclyic Graph represents the different stages in the application. Each blue box in the graph represents a Spark operation invoked from the application.
From the stage details page, you can also launch the application timeline view. Expand the Event Timeline link at the top of the page, as shown below.
This displays the Spark events in the form of a timeline. The timeline view is available at three levels, across jobs, within a job, and within a stage. The image above captures the timeline view for a given stage.
If you select the Enable zooming check box, you can scroll left and right across the timeline view.
Other tabs in the Spark UI provide useful information about the Spark instance as well.
- Storage tab - If your application creates an RDDs, you can find information about those in the Storage tab.
- Environment tab - This tab provides a lot of useful information about your Spark instance such as the
- Scala version
- Event log directory associated with the cluster
- Number of executor cores for the application
Find information about completed jobs using the Spark History Server
Once a job is completed, the information about the job is persisted in the Spark History Server.
To launch the Spark History Server, from the cluster blade, click Cluster Dashboard, and then click Spark History Server.
Alternatively, you can also launch the Spark History Server UI from the Ambari UI. To launch the Ambari UI, from the cluster blade, click Cluster Dashboard, and then click HDInsight Cluster Dashboard. From the Ambari UI, click Spark, click Quick Links, and then click Spark History Server UI.
You will see all the completed applications listed. Click an application ID to drill down into an application for more info.
For data analysts
- Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data
- Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results
- Website log analysis using Spark in HDInsight
- Application Insight telemetry data analysis using Spark in HDInsight
- Use Caffe on Azure HDInsight Spark for distributed deep learning
For Spark developers
- Create a standalone application using Scala
- Run jobs remotely on a Spark cluster using Livy
- Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications
- Spark Streaming: Use Spark in HDInsight for building real-time streaming applications
- Use HDInsight Tools Plugin for IntelliJ IDEA to debug Spark applications remotely
- Use Zeppelin notebooks with a Spark cluster on HDInsight
- Kernels available for Jupyter notebook in Spark cluster for HDInsight
- Use external packages with Jupyter notebooks
- Install Jupyter on your computer and connect to an HDInsight Spark cluster