question

DavidBeavon-2754 avatar image
0 Votes"
DavidBeavon-2754 asked DavidBeavon-2754 answered

Spark U/I is not loading for recently "terminated" job clustsers

I've been having trouble with a feature in Azure Databricks. The Spark U/I will not be shown for job clusters that have recently completed ("terminated"). There is a link to the Spark U/I in the databricks portal. But when you click the link, you are presented with a status message that just says "Loading" :

Loading old UI for cluster "whatever"... This may take a few minutes.



This screen will remain the same for a long period of time (hours) and I will eventually lose patience and close the window. I haven't yet tried to wait overnight ... and even if I tried, I'm not sure if it would be reasonable to wait that amount of time for the U/I to respond.

When I previously encounter this issue, and opened a support ticket with databricks/azure-databricks, they were not able to confirm any outages during the period in question. So far we have established that there is a "Spark History Server UI" which is a shared resource that can become congested with requests from multiple customers. I'm assuming this implies that the issue is simultaneously affecting multiple customers ... although we haven't yet established that for certain.


I've been using Azure Databricks in production for a few months now, and I'm not familiar enough to know if the issue could be specific to us, or if it might be a chronic issue that affects others in the same region as well. I googled for the problem and was unable to find any results for my search. So I thought it would be good to start a new discussion about the problem here in the Q&A.

Please let me know if anyone has an explanation, or has experienced this themselves. I'm also eager to hear if there are any tricks to get the workspace to start working properly. Whenever I encountered this issue, the problem wouldn't go away on its own until a day or two had passed. I haven't yet gotten any acknowledgement of these outages from Microsoft's perspective.

azure-databricksdotnet-ml-big-data
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

DavidBeavon-2754 avatar image
0 Votes"
DavidBeavon-2754 answered

It has been over a month, but the documentation is worth the wait:

https://kb.databricks.com/clusters/replay-cluster-spark-events.html




If the Spark U/I is not working in the databricks portal, then you can use this as a plan "B".

The new documentation will explain the full workaround. This is a solution that databricks support engineers have been recommending for a long time.... Previously you had to open a support case before they would give you this secret workaround.

Hope it helps.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

DavidBeavon-2754 avatar image
0 Votes"
DavidBeavon-2754 answered PRADEEPCHEEKATLA-MSFT commented

Over the past week I've been able to learn a bit more about this issue from tech support.

There is apparently a workaround that is available to customers if/when the "Spark History Server" isn't showing the Spark UI successfully in a databricks workspace.

The workaround is only possible if you start by configuring a cluster to deliver logs to a dbfs location. Within those logs is an "eventlog" file which what is used to render the Spark UI.


This workaround allows us to render the eventlog file which is found in the delivered logs. The eventlog can only be rendered into a freshly-started all-purpose cluster. Once the events are rendered, the Spark UI can be reviewed as normal. The folks in Databricks engineering said they would make this workaround available in a KB article once it has been tested by a sufficient number of customers. This approach is called "replaying" the events. The signature of their method looks like so:

def replaySparkEvents(pathToEventLogs: String): Unit = { ... }


If/when you are unable to use the Spark UI in the Azure Databricks workspace, you should contact tech support. They are likely to provide you with this workaround, especially since the problems with this Spark UI seem to be persistent and recurring and unpredictable.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

DavidBeavon-2754

Glad to know that your issue has resolved. And thanks for sharing the solution, which might be beneficial to other community members reading this thread.


Do click on Accept Answer and Up-Vote on the post that helps you, this can be beneficial to other community members.

0 Votes 0 ·
DavidBeavon-2754 avatar image
0 Votes"
DavidBeavon-2754 answered PRADEEPCHEEKATLA-MSFT commented

@PRADEEPCHEEKATLA-MSFT
Its not a resolution, it is a workaround. I'm still working with tech support to understand the circumstances when the "normal" Spark UI functionality stops working. As I mentioned, it seems to be a recurring and unpredictable problem.

Given the existence of the workaround (replaySparkEvents), and given that it was delivered to me within a day of reporting the problem, it is pretty clear that this is not a new topic for the folks at Azure Databricks.

However, I think they need to refocus their efforts on the root cause that is preventing the "Spark UI" from working reliably in the first place. The workaround is quite a bit more effort, and involves more configuration than simply clicking the link in the Azure Databricks workspace. If that worked consistently it would save everyone a lot of time, and avoid future tech support. I am still waiting on the public KB for "replaySparkEvents" and will post it here when I have a link.


· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello @DavidBeavon-2754,

When you say you are working with tech support, could you please share the Support Request (SR) number to track internally?

0 Votes 0 ·

@PRADEEPCHEEKATLA-MSFT
Yes, tracking ID is # 2103310040002620

There have been several engineers involved both within databricks and azure-databricks. I was given the workaround within two days (ie. the "replaySparkEvents" notebook) .

Despite how quickly we received the workaround, it is taking longer for them to figure out why the Spark UI is going AWOL from time to time.

Note that it is normal for the UI to take a few minutes to load, since the job cluster is terminated and it must to be rehydrated from the event logs. But we also are also experiencing this other issue where the Spark UI will not be loaded no matter how long we wait.

0 Votes 0 ·

Hello @DavidBeavon-2754,

Thanks for sharing the Support Request number.
Currently, our support engineer is identifying the issue and get back to you soon.

1 Vote 1 ·
DavidBeavon-2754 avatar image
0 Votes"
DavidBeavon-2754 answered

There should be documentation shortly for the workaround (replaySparkEvents) .

Also the underlying bug in the Spark history server should be fixed in the next year or so. You can inquire about the details by contacting azure-databricks and providing the improvement ID for the upcoming History Server enhancements (DB-I-3506).

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.