question

PavloVitynskyi-8395 avatar image
8 Votes"
PavloVitynskyi-8395 asked florentpousserot131313 edited

Spark job freezes without any progress for a long period of time

I have multiple Spark jobs deployed on Azure Databricks. Usually, each of them takes less than 1 hour to process data and scheduled to run every hour.
I faced the following issue:
Sometimes a job had been running for an extremaly long period of time (a few hours or even days) without any progress until I cancel it.
The version of Databricks Runtime is '7.5 (includes Apache Spark 3.0.1, Scala 2.12)'
During the inactivity period the only messages in the driver logs are:

 21/03/26 22:34:38 INFO HiveMetaStore: 1: get_database: default
 21/03/26 22:34:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
 21/03/26 22:34:38 INFO DriverCorral: Metastore health check ok
 21/03/26 22:39:29 INFO DriverCorral: DBFS health check ok
 21/03/26 22:39:30 WARN MetastoreMonitor: Failed to connect to the metastore InternalMysqlMetastore(DbMetastoreConfig{host=consolidated-centralus-prod-metastore-addl.mysql.database.azure.com, port=3306, dbName=organization123456789, user=[REDACTED]}). (timeSinceLastSuccess=16500028)
 java.lang.IllegalArgumentException: A health check named database already exists
  at com.codahale.metrics.health.HealthCheckRegistry.register(HealthCheckRegistry.java:101)
  at com.databricks.instrumentation.Instrumented$Dsl.instrumentJdbi(Instrumented.scala:242)
  at com.databricks.common.database.DatabaseUtils$.createDBI(DatabaseUtils.scala:170)
  at com.databricks.common.database.DatabaseUtils$.withDBI(DatabaseUtils.scala:497)
  at com.databricks.backend.daemon.driver.MetastoreMonitor.checkMetastore(MetastoreMonitor.scala:177)
  at com.databricks.backend.daemon.driver.MetastoreMonitor.$anonfun$doMonitor$1(MetastoreMonitor.scala:154)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:432)
  at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
  at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
  at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
  at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
  at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:277)
  at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:270)
  at com.databricks.threading.NamedTimer$$anon$1.withAttributionTags(NamedTimer.scala:94)
  at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:413)
  at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:339)
  at com.databricks.threading.NamedTimer$$anon$1.recordOperation(NamedTimer.scala:94)
  at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$2(NamedTimer.scala:103)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
  at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
  at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
  at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
  at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:94)
  at com.databricks.logging.UsageLogging.disableTracing(UsageLogging.scala:833)
  at com.databricks.logging.UsageLogging.disableTracing$(UsageLogging.scala:832)
  at com.databricks.threading.NamedTimer$$anon$1.disableTracing(NamedTimer.scala:94)
  at com.databricks.threading.NamedTimer$$anon$1.$anonfun$run$1(NamedTimer.scala:102)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
  at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:100)
  at com.databricks.threading.NamedTimer$$anon$1.run(NamedTimer.scala:101)
  at java.util.TimerThread.mainLoop(Timer.java:555)
  at java.util.TimerThread.run(Timer.java:505)
 21/03/26 22:39:38 INFO HiveMetaStore: 1: get_database: default
 21/03/26 22:39:38 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
 21/03/26 22:39:38 INFO DriverCorral: Metastore health check ok
 21/03/26 22:44:29 INFO DriverCorral: DBFS health check ok

It would be good to know the exact reason for this problem.

Thanks









azure-databricks
· 17
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

HI @PavloVitynskyi-8395,

Thanks for reaching out. As you called out that the job usually complete in 1 hr and seems to run forever suddenly. For deeper investigation and immediate assistance, I would suggest you to file a support case so that a support engineer can gather the required logs to further troubleshoot the issue.

Let us know if you don't have a support plan.

Thanks

0 Votes 0 ·

HI @PavloVitynskyi-8395,

Just checking to see if you have got a chance to file a support ticket? If you don't have a support plan, please do let us know.


Thank you

0 Votes 0 ·

Hi @KranthiPakala-MSFT ,
Yes, I created a support ticket and waiting for a response.

0 Votes 0 ·
Show more comments

Meet the same problem with 'spark_version': '7.2.x-scala2.12', any solution or workaround for it? thanks.

0 Votes 0 ·

Hi all,

Apologies for the delayed response. By looking at the existing support ticket, the troubleshooting is still on-going with Databricks team.

If this is a blocker, I would recommend filing a support ticket so that a support engineer can gather required information and collaborate with databricks team for further troubleshooting.

Thank you

0 Votes 0 ·

0 Answers