Use extended Apache Spark history server to debug and diagnose Apache Spark applications

This article provides guidance on how to use the extended Apache Spark history server to debug and diagnose completed and running Spark applications.

The extension includes a data tab, graph tab, and diagnosis tab. Use the Data tab to check the input and output data of the Spark job. The Graph tab shows you the data flow and replay of the job graph. The Diagnosis tab shows you Data Skew, Time Skew, and Executor Usage Analysis.

Access the Apache Spark history server

Apache Spark history server is the web user interface for completed and running Spark applications. You can open the Apache Spark history server web interface from the Azure Synapse Analytics.

Open the Spark History Server web UI from Apache spark applications node

  1. Open Azure Synapse Analytics.

  2. Select Monitor, then select Apache Spark Applications.

    Select monitor then select spark application.

  3. Select an application, then open Log query by selecting it.

    Open log query window.

  4. Select Spark history server, then the Spark History Server web UI will show up.

    Open spark history server.

Open the Spark History Server web UI from Data node

  1. From your Synapse Studio notebook, select Spark history server from the job execution output cell or from the status panel at the bottom of the notebook document. Select Session details.

    Launch Spark history server 1

  2. Select Spark history server from the slide out panel.

    Launch Spark history server 2

Explore the Data tab in Spark history server

Select the Job ID for the job you want to view. Then select Data on the tool menu to get the data view. This section shows you how to do various tasks in the Data tab.

  • Check the Inputs, Outputs, and Table Operations by selecting the tabs separately.

    Data for Spark application tabs

  • Copy all rows by selecting Copy.

    Data for Spark application copy

  • Save all data as CSV file by selecting csv.

    Data for Spark application save

  • Search by entering keywords in field Search. The search results display immediately.

    Data for Spark application search

  • Select the column header to sort table, select the plus sign to expand a row to show more details, or select the minus sign to collapse a row.

    Data for Spark application table

  • Download a single file by selecting Partial Download. The selected file is downloaded to local. If the file no longer exists, a new tab appears with an error message.

    Data for Spark application download row

  • To copy a full path or relative path, select the Copy Full Path or Copy Relative Path options that expand from the drop-down menu. For Azure Data Lake Storage files, Open in Azure Storage Explorer launches Azure Storage Explorer and locates the folder when you are signed in.

    Data for Spark application copy path

  • Select page numbers below the table to navigate pages when there are too many rows to display in one page.

    Data for Spark application page

  • Hover on the question mark beside Data to show the tooltip, or select the question mark to get more information.

    Data for Spark application more info

  • Send feedback with issues by selecting Provide us feedback.

    Spark graph provide us feedback again

Graph tab in Apache Spark history server

Select the Job ID for the job you want to view. Then, select Graph on the tool menu to get the job graph view.

Overview

You can see an overview of your job in the generated job graph. By default, the graph shows all jobs. You can filter this view by Job ID.

Spark application and job graph job ID

Display

By default, the Progress display is selected. You can check the data flow by selecting Read or Written in the Display dropdown list.

Spark application and job graph display

The graph node displays the colors shown in the heatmap legend.

Spark application and job graph heatmap

Playback

To playback the job, select Playback. You can select Stop at any time to stop. The task colors show different statuses when playing back:

Color Meaning
Green Succeeded: The job has completed successfully.
Orange Retried: Instances of tasks that failed but do not affect the final result of the job. These tasks had duplicate or retry instances that may succeed later.
Blue Running: The task is running.
White Waiting or skipped: The task is waiting to run, or the stage has skipped.
Red Failed: The task has failed.

The following image shows Green, Orange, and Blue status colors.

Spark application and job graph color sample, running

The following image shows Green and White status colors.

Spark application and job graph color sample, skip

The following image shows Red and Green status colors.

Spark application and job graph color sample, failed

Note

Playback for each job is allowed. For incomplete jobs, playback is not supported.

Zoom

Use your mouse scroll to zoom in and out on the job graph, or select Zoom to fit to make it fit to screen.

Spark application and job graph zoom to fit

Tooltips

Hover on graph node to see the tooltip when there are failed tasks, and select a stage to open its stage page.

Spark application and job graph tooltip

On the job graph tab, stages have a tooltip and a small icon displayed if they have tasks that meet the following conditions:

Condition Description
Data skew data read size > average data read size of all tasks inside this stage * 2 and data read size > 10 MB
Time skew execution time > average execution time of all tasks inside this stage * 2 and execution time > 2 minutes

Spark application and job graph skew icon

Graph node description

The job graph node displays the following information of each stage:

  • ID.

  • Name or description.

  • Total task number.

  • Data read: the sum of input size and shuffle read size.

  • Data write: the sum of output size and shuffle writes size.

  • Execution time: the time between start time of the first attempt and completion time of the last attempt.

  • Row count: the sum of input records, output records, shuffle read records and shuffle write records.

  • Progress.

    Note

    By default, the job graph node displays information from the last attempt of each stage (except for stage execution time). However, during playback, the graph node shows information of each attempt.

    The data size of read and write is 1MB = 1000 KB = 1000 * 1000 Bytes.

Provide feedback

Send feedback with issues by selecting Provide us feedback.

Spark application and job graph feedback

Explore the Diagnosis tab in Apache Spark history server

To access the Diagnosis tab, select a job ID. Then select Diagnosis on the tool menu to get the job Diagnosis view. The diagnosis tab includes Data Skew, Time Skew, and Executor Usage Analysis.

Check the Data Skew, Time Skew, and Executor Usage Analysis by selecting the tabs respectively.

SparkUI diagnosis data skew tab again

Data Skew

When you select the Data Skew tab, the corresponding skewed tasks are displayed based on the specified parameters.

  • Specify Parameters - The first section displays the parameters, which are used to detect Data Skew. The default rule is: Task Data Read is greater than three times of the average task data read, and the task data read is more than 10 MB. If you want to define your own rule for skewed tasks, you can choose your parameters. The Skewed Stage and Skew Char sections are refreshed accordingly.

  • Skewed Stage - The second section displays stages, which have skewed tasks meeting the criteria specified above. If there is more than one skewed task in a stage, the skewed stage table only displays the most skewed task (for example, the largest data for data skew).

    sparkui diagnosis data skew tab

  • Skew Chart – When a row in the skew stage table is selected, the skew chart displays more task distribution details based on data read and execution time. The skewed tasks are marked in red and the normal tasks are marked in blue. The chart displays up to 100 sample tasks, and the task details are displayed in right bottom panel.

    sparkui skew chart for stage 10

Time Skew

The Time Skew tab displays skewed tasks based on task execution time.

  • Specify Parameters - The first section displays the parameters, which are used to detect Time Skew. The default criteria to detect time skew is: task execution time is greater than three times of average execution time and task execution time is greater than 30 seconds. You can change the parameters based on your needs. The Skewed Stage and Skew Chart display the corresponding stages and tasks information just like the Data Skew tab above.

  • Select Time Skew, then filtered result is displayed in Skewed Stage section according to the parameters set in section Specify Parameters. Select one item in Skewed Stage section, then the corresponding chart is drafted in section3, and the task details are displayed in right bottom panel.

    sparkui diagnosis time skew section

Executor Usage Analysis

The Executor Usage Graph visualizes the Spark job executor's allocation and running status.

  1. Select Executor Usage Analysis, then four types curves about executor usage are drafted, including Allocated Executors, Running Executors, Idle Executors, and Max Executor Instances. For allocated executors, each "Executor added" or "Executor removed" event increases or decreases the allocated executors. You can check "Event Timeline" in the "Jobs" tab for more comparison.

    sparkui diagnosis executors tab

  2. Select the color icon to select or unselect the corresponding content in all drafts.

    sparkui diagnoses select chart

Known issues

Input/output data using Resilient Distributed Datasets (RDDs) doesn't show in the data tab.

Next steps