Autoscaling is slow with an external metastore

Improve autoscaling performance by only installing metastore jars to the driver.

Last published at: May 16th, 2022

Problem

You have an external metastore configured on your cluster and autoscaling is enabled, but the cluster is not autoscaling effectively.

Cause

You are copying the metastore jars to every executor, when they are only needed in the driver.

It takes time to initialize and run the jars every time a new executor spins up. As a result, adding more executors takes longer than it should.

Solution

You should configure your cluster so the metastore jars are only copied to the driver.

Option 1: Use an init script to copy the metastore jars.

Create a cluster with spark.sql.hive.metastore.jars set to maven and spark.sql.hive.metastore.version to match the version of your metastore.
Start the cluster and search the driver logs for a line that includes Downloaded metastore jars to.
```
17/11/18 22:41:19 INFO IsolatedClientLoader: Downloaded metastore jars to <path>
```
<path> is the location of the downloaded jars in the driver node of the cluster.

Copy the jars to a DBFS location.

%sh

cp -r <path> /dbfs/ExternalMetaStore_jar_location

Create the init script.

%python

dbutils.fs.put("dbfs:/databricks/<init-script-folder>/external-metastore-jars-to-driver.sh",
"""
#!/bin/bash
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
mkdir -p /databricks/metastorejars/
cp -r /dbfs/ExternalMetaStore_jar_location/* /databricks/metastorejars/
fi""", True)

Install the init script that you just created as a cluster-scoped init script (AWS | Azure | GCP).
You will need the full path to the location of the script (dbfs:/databricks/<init-script-folder>/external-metastore-jars-to-driver.sh).
Restart the cluster.

Option 2: Use the Apache Spark configuration settings to copy the metastore jars to the driver.

Enter the following settings into your Spark config (AWS | Azure | GCP):

spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<mysql-host>:<mysql-port>/<metastore-db>
spark.hadoop.javax.jdo.option.ConnectionDriverName <driver>
spark.hadoop.javax.jdo.option.ConnectionUserName <mysql-username>
spark.hadoop.javax.jdo.option.ConnectionPassword <mysql-password>
spark.sql.hive.metastore.version <hive-version>
spark.sql.hive.metastore.jars /dbfs/metastore/jars/*

The source path can be external mounted storage or DBFS.
The metastore configuration can be applied globally within the workspace by using cluster policies (AWS | Azure | GCP).

Option 3: Build a custom Databricks container with preloaded jars on AWS or Azure.

Review the documentation on customizing containers with Databricks Container Services.

Data too long for column error
Problem You are trying to insert a struct into a table, but you get a java.sql.SQ...
Drop database without deletion
By default, the DROP DATABASE (AWS | Azure | GCP) command drops the database and ...
How to create table DDLs to import into an external metastore
Databricks supports using external metastores instead of the default Hive metasto...
Drop tables with corrupted metadata from the metastore
Problem Sometimes you cannot drop a table from the Databricks UI. Using %sql or s...
Error in CREATE TABLE with external Hive metastore
Problem You are connecting to an external MySQL metastore and attempting to creat...

Databricks Knowledge Base

Contact Us

Problem

Cause

Solution