Azure - Databricks

Databricks administration (Azure)

How to discover who deleted a cluster in Azure portal

Learn how to discover who deleted an Azure Databricks cluster....

Last updated: October 3rd, 2023 by Adam Pavlacka

How to discover who deleted a workspace in Azure portal

Learn how to discover who deleted an Azure Databricks workspace....

Last updated: December 6th, 2023 by Adam Pavlacka

Find your workspace ID

Learn how to find your Databricks workspace ID in the web UI as well as via a notebook command....

Last updated: October 25th, 2022 by sivaprasad.cs

Failed to add user error due to email or username already existing with a different case

You should ensure casing for usernames is consistent across all accounts and providers in your system....

Last updated: January 20th, 2023 by harrison.schueler

Cannot access Databricks secrets when using a "No isolation shared" cluster

You cannot use dbutils.secrets.get() when admin protection for No isolation shared clusters is enabled in your account....

Last updated: March 31st, 2023 by sivaprasad.cs

Cloud infrastructure (Azure)

Configure custom DNS settings using dnsmasq

Learn how to configure custom DNS settings using dnsmasq....

Last updated: December 22nd, 2022 by brian.sears

How to analyze user interface performance issues

Learn how to troubleshoot Databricks user interface performance issues....

Last updated: December 20th, 2023 by Adam Pavlacka

Unable to mount Azure Data Lake Storage Gen1 account

Learn how to resolve errors that occur when mounting Azure Data Lake Storage Gen1 to Databricks....

Last updated: February 25th, 2022 by Adam Pavlacka

Assign a single public IP for VNet-injected workspaces using Azure Firewall

Learn how to assign a single public IP address for a Databricks workspace in a virtual network using Azure Firewall....

Last updated: December 7th, 2022 by Adam Pavlacka

Network configuration of Azure Data Lake Storage Gen1 causes ADLException: Error getting info for file

Learn how to resolve credential passthrough failure when using Azure Data Lake Storage Gen1 with Databricks....

Last updated: December 7th, 2022 by Adam Pavlacka

Jobs are not progressing in the workspace

Learn how to troubleshoot a Databricks workspace in a virtual network using Azure Firewall. "WARN TaskSchedulerImpl Initial job has not accepted any resources" error message....

Last updated: December 7th, 2022 by Adam Pavlacka

SAS requires current ABFS client

SAS requires the current ABFS client; old clients generate an `IllegalArgumentException` error....

Last updated: December 7th, 2022 by kavya.parag

Business intelligence tools (Azure)

Configure Simba ODBC driver with a proxy in Windows

How to configure the Simba ODBC driver to connect through a proxy server when using Windows....

Last updated: March 2nd, 2022 by jordan.hicks

Configure Simba JDBC driver using Azure AD

Access Databricks with a Simba JDBC driver using an Azure user account and Azure AD authentication....

Last updated: December 7th, 2022 by arvind.ravish

Power BI proxy and SSL configuration

Learn how to set up Power BI with a proxy or VPN....

Last updated: December 7th, 2022 by Adam Pavlacka

Clusters (Azure)

Enable OpenJSSE and TLS 1.3

Add OpenJSSE to allow the use of TLS 1.3 for encrypted data transmission....

Last updated: March 2nd, 2022 by Adam Pavlacka

How to calculate the number of cores in a cluster

Learn how to calculate the number of cores in a Databricks cluster....

Last updated: March 2nd, 2022 by Adam Pavlacka

Install a private PyPI repo

How to install libraries from private PyPI repositories....

Last updated: March 4th, 2022 by darshan.bargal

IP access list update returns INVALID_STATE

Cannot update IP access list. INVALID_STATE error message....

Last updated: March 4th, 2022 by Gobinath.Viswanathan

Cannot apply updated cluster policy

When performing an update to an existing cluster policy, the update does not apply unless you remove and re-add the policy....

Last updated: March 4th, 2022 by jordan.hicks

Cluster Apache Spark configuration not applied

Values set in your cluster's Spark configuration are not applying correctly....

Last updated: March 4th, 2022 by Gobinath.Viswanathan

Cluster failed to launch

Learn how to resolve cluster launch failures....

Last updated: March 4th, 2022 by Adam Pavlacka

Custom Docker image requires root

Custom Docker containers must be configured to start as the root user when used with Databricks....

Last updated: March 4th, 2022 by dayanand.devarapalli

Job fails due to cluster manager core instance request limit

Learn how to troubleshoot Databricks errors related to API rate limits....

Last updated: March 4th, 2022 by Adam Pavlacka

Admin user cannot restart cluster to run job

Learn how to re-grant privileges to Databricks Admin users....

Last updated: March 4th, 2022 by Adam Pavlacka

Cluster fails to start with dummy does not exist error

Cluster is not starting due to a `dummy does not exist` Apache Spark error message....

Last updated: March 4th, 2022 by arvind.ravish

Cluster slowdown due to Ganglia metrics filling root partition

Resolve cluster slowdowns due to a Ganglia metric data explosion filling the root partition....

Last updated: March 4th, 2022 by arjun.kaimaparambilrajan

Failed to create cluster with invalid tag value

Cluster creation fails if optional tag values do not conform to cloud vendor requirements....

Last updated: March 4th, 2022 by kavya.parag

Persist Apache Spark CSV metrics to a DBFS location

Persist Spark CSV metrics to a sink in a DBFS location....

Last updated: March 4th, 2022 by Adam Pavlacka

Replay Apache Spark events in a cluster

Use a single node cluster to replay another cluster's event log in the Spark UI....

Last updated: February 10th, 2023 by arjun.kaimaparambilrajan

Set Apache Hadoop core-site.xml properties

Set Apache Hadoop core-site.xml properties in a Databricks cluster....

Last updated: March 4th, 2022 by arjun.kaimaparambilrajan

Set executor log level

Learn how to set the log levels on Databricks executors....

Last updated: March 4th, 2022 by Adam Pavlacka

Apache Spark job doesn’t start

Learn how to troubleshoot a Databricks Spark job that won't start....

Last updated: March 4th, 2022 by Adam Pavlacka

Auto termination is disabled when starting a job cluster

Auto termination policies are not supported on job clusters....

Last updated: August 23rd, 2022 by navya.athiraram

Unexpected cluster termination

Learn how to troubleshoot a Databricks cluster that stopped unexpectedly....

Last updated: March 4th, 2022 by Adam Pavlacka

How to configure single-core executors to run JNI libraries

Learn how to configure single-core executors to run JNI libraries on Databricks....

Last updated: March 4th, 2022 by Adam Pavlacka

How to overwrite log4j configurations on Databricks clusters

Learn how to overwrite log4j configurations on Databricks clusters....

Last updated: February 29th, 2024 by Adam Pavlacka

Apache Spark executor memory allocation

Understand how Spark executor memory allocation works in a Databricks cluster....

Last updated: March 4th, 2022 by Adam Pavlacka

Apache Spark UI shows less than total node memory

Learn what to do when the Spark UI shows less memory than is actually available on the node....

Last updated: July 22nd, 2022 by Adam Pavlacka

Configure a cluster to use a custom NTP server

Configure your clusters to use a custom NTP server (public or private) instead of using the default server....

Last updated: December 8th, 2022 by xin.wang

Enable GCM cipher suites

Enable AES-GCM encryption (GCM cipher suites) for use with SSL connections to other clusters. Resolve javax.net.ssl.SSLHandshakeException error....

Last updated: December 8th, 2022 by xin.wang

Enable retries in init script

Add a retry function to your init script....

Last updated: March 4th, 2022 by arjun.kaimaparambilrajan

Cannot set a custom PYTHONPATH

Setting a custom PYTHONPATH in an init script or in DCS is not supported....

Last updated: September 13th, 2022 by prakash.jha

Run a custom Databricks runtime on your cluster

Configure your cluster to run a custom Databricks runtime image via the UI or API....

Last updated: October 26th, 2022 by rakesh.parija

Cluster init script fails with mirror sync in progress error

If the mirror you are using is not in sync with the main repository, apt-get update returns a Mirror sync in progress error....

Last updated: October 31st, 2022 by harrison.schueler

Slow cluster launch and missing nodes

Learn how to resolve a "nodes could not be acquired" error when starting a Databricks ....

Last updated: December 8th, 2022 by Adam Pavlacka

IP address limit prevents cluster creation

Learn how to fix a public IP address quota limit Cloud Provider Launch error when starting a Databricks cluster....

Last updated: May 30th, 2023 by laila.haddad

CPU core limit prevents cluster creation

Learn how to fix a CPU core quota limit Cloud Provider Launch error when starting a Databricks cluster....

Last updated: December 8th, 2022 by Adam Pavlacka

Custom garbage collection prevents cluster launch

Using a custom garbage collection algorithm on Databricks Runtime 10.0 and above prevents the cluster from starting....

Last updated: December 8th, 2022 by harikrishnan.kunhumveettil

SSH to the cluster driver node

How to SSH to the Apache Spark cluster driver node in an Azure virtual network...

Last updated: March 15th, 2023 by xin.wang

Adding a configuration setting overwrites all default spark.executor.extraJavaOptions settings

Learn how to resolve overwritten configuration settings in Databricks....

Last updated: December 8th, 2022 by Adam Pavlacka

UnknownHostException on cluster launch

Troubleshoot an UnknownHostException on cluster launch. This is often a DNS configuration issue....

Last updated: December 8th, 2022 by arnab.saha

Pin cluster configurations using the API

Pin up to 100 compute cluster configurations using the API....

Last updated: December 21st, 2022 by simran.arora

Unpin cluster configurations using the API

Unpin compute cluster configurations using the API....

Last updated: December 21st, 2022 by simran.arora

R commands fail on custom Docker cluster

R version 4.2.0 changed the way Renviron.site is initialized, so you must set an environment variable when using custom Docker clusters....

Last updated: January 20th, 2023 by Atanu.Sarkar

Apache Spark UI task logs intermittently return HTTP 500 error

If the Spark property spark.databricks.ui.logViewingEnabled is set to false, you cannot view task logs in the Spark UI....

Last updated: March 17th, 2023 by vivian.wilfred

Legacy global init script migration notebook

Easily migrate your legacy global init scripts to the current global init script framework....

Last updated: August 28th, 2023 by Adam Pavlacka

Python kernel is unresponsive error message

Learn how to identify and troubleshoot the cause of an unresponsive Python kernel error....

Last updated: April 17th, 2023 by laila.haddad

Spark image download failure error message

Learn how to troubleshoot the Spark image download failure error message....

Last updated: April 17th, 2023 by laila.haddad

Disable cluster-scoped init scripts on DBFS

Set a cluster policy to prevent users from creating clusters that load cluster-scoped init scripts from DBFS....

Last updated: May 2nd, 2023 by Adam Pavlacka

Cluster-named and cluster-scoped init script migration notebook

Easily migrate your cluster-named and cluster-scoped init scripts to cluster-scoped init scripts stored as workspace files....

Last updated: February 27th, 2024 by Adam Pavlacka

Cluster fails with Fatal uncaught exception error. Failed to bind.

If other software uses port 6062, it can conflict with the IPython kernel REPL and prevent the driver node from starting....

Last updated: July 17th, 2023 by simran.arora

Log delivery feature not generating log4j logs for executor folders

Log delivery only generates a log file for the driver folder. This is by design....

Last updated: November 30th, 2023 by Adam Pavlacka

Use a cluster policy to disable Photon

You can use cluster policies to prevent users from creating clusters with Photon enabled....

Last updated: November 30th, 2023 by Adam Pavlacka

Shorten cluster provisioning time by using Docker containers

Learn how to speed up cluster provisioning by using Docker container services...

Last updated: November 30th, 2023 by Adam Pavlacka

DBFS init script detection notebook

Scan your workspace for init scripts on DBFS....

Last updated: March 26th, 2024 by Adam Pavlacka

Workspace is not UC enabled

Troubleshooting errors related to workspace not being UC enabled...

Last updated: December 4th, 2023 by Adam Pavlacka

Migration guidance for init scripts on DBFS

Init scripts on DBFS are end-of-life. You should migrate them to cloud storage, Unity Catalog volumes, or workspace files....

Last updated: February 5th, 2024 by Adam Pavlacka

Data management (Azure)

Append to a DataFrame

Learn how to append to a DataFrame in Databricks....

Last updated: March 4th, 2022 by Adam Pavlacka

How to improve performance with bucketing

Learn how to improve Databricks performance by using bucketing....

Last updated: February 29th, 2024 by Adam Pavlacka

Simplify chained transformations

Learn how to simplify chained transformations on your DataFrame in Databricks....

Last updated: May 25th, 2022 by Adam Pavlacka

How to dump tables in CSV, JSON, XML, text, or HTML format

Learn how to output tables from Databricks in CSV, JSON, XML, text, or HTML format....

Last updated: May 25th, 2022 by Adam Pavlacka

Hive UDFs

Learn how to create and use a Hive UDF for Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

Prevent duplicated columns when joining two DataFrames

Learn how to prevent duplicated columns when joining two DataFrames in Databricks....

Last updated: October 13th, 2022 by Adam Pavlacka

Revoke all user privileges

Use a regex and a series of for loops to revoke all privileges for a single user....

Last updated: May 31st, 2022 by pavan.kumarchalamcharla

How to list and delete files faster in Databricks

Learn how to list and delete files faster in Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

How to handle corrupted Parquet files with different schema

Learn how to read Parquet files with a specific schema using Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

No USAGE permission on database

User does not have USAGE permission on the database....

Last updated: May 31st, 2022 by rakesh.parija

Nulls and empty strings in a partitioned column save as nulls

Learn why nulls and empty strings in a partitioned column save as nulls in Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

Behavior of the randomSplit method

Learn about inconsistent behaviors when using the randomSplit method in Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

Generate schema from case class

Learn how to generate a schema from a Scala case class....

Last updated: May 31st, 2022 by Adam Pavlacka

How to specify skew hints in dataset and DataFrame-based join commands

Learn how to specify skew hints in Dataset and DataFrame-based join commands in Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

How to update nested columns

Learn how to update nested columns in Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

Incompatible schema in some files

Learn how to resolve incompatible schema in Parquet files with Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

Unable to infer schema for ORC error

Apache Spark returns an error for ORC files if no schema is defined when reading from an empty directory or a base path with multiple subfolders....

Last updated: December 1st, 2022 by chandana.koppal

Access files written by Apache Spark on ADLS Gen1

Configure permissions to allow access to files that Apache Spark writes to ADLS Gen1 storage....

Last updated: December 9th, 2022 by dayanand.devarapalli

Object ownership is getting changed on dropping and recreating tables

Use TRUNCATE or REPLACE for tables and ALTER VIEW for views instead of dropping and recreating them....

Last updated: December 15th, 2022 by akash.bhat

User does not have permission SELECT on ANY File

Regular users cannot create tables without permission when access control is enabled....

Last updated: May 16th, 2023 by sivaprasad.cs

Data sources (Azure)

Create tables on JSON datasets

Create tables on JSON datasets; requires SerDe JAR....

Last updated: May 31st, 2022 by ram.sankarasubramanian

Failure when mounting or accessing Azure Blob storage

Learn how to resolve a failure when mounting or accessing Azure Blob storage from Databricks....

Last updated: May 31st, 2022 by Adam Pavlacka

Unable to read files and list directories in a WASB filesystem

Learn how to interpret errors that occur when accessing WASB append blob types in Databricks....

Last updated: June 1st, 2022 by Adam Pavlacka

Optimize read performance from JDBC data sources

Learn how to optimize performance when reading from JDBC data sources in Databricks....

Last updated: June 1st, 2022 by Adam Pavlacka

Troubleshooting JDBC/ODBC access to Azure Data Lake Storage Gen2

Learn how to troubleshoot JDBC and ODBC access to Azure Data Lake Storage Gen2 from Databricks....

Last updated: June 1st, 2022 by Adam Pavlacka

CosmosDB-Spark connector library conflict

Learn how to resolve conflicts that arise when using the CosmosDB-Spark connector library with Databricks....

Last updated: June 1st, 2022 by Adam Pavlacka

Failure to detect encoding in JSON

Learn how to resolve a failure to detect encoding of input JSON files when using BOM with Databricks....

Last updated: June 1st, 2022 by Adam Pavlacka

Inconsistent timestamp results with JDBC applications

Timestamp records are inconsistent with JDBC applications when daylight saving time adjustments are made....

Last updated: June 1st, 2022 by manjunath.swamy

Kafka client terminated with OffsetOutOfRangeException

Kafka client is terminated with `OffsetOutOfRangeException` when trying to fetch messages...

Last updated: June 1st, 2022 by vikas.yadav

ABFS client hangs if incorrect client ID or wrong path used

Trying to access an Azure Blob File System (ABFS) path results in a hung command when using Azure Data Lake Storage Gen2 (ADLS)....

Last updated: June 1st, 2022 by Adam Pavlacka

Reading a table fails due to AAD token timeout on ADLS Gen2

Accessing ADLS Gen2 storage fails if the AAD service principal token is expired or invalid....

Last updated: November 30th, 2022 by John.Lourdu

Recursive references in Avro schema are not allowed

Apache Avro data sources cannot have recursive references in the schema when used with Spark....

Last updated: December 1st, 2022 by saikrishna.pujari

Error when reading data from ADLS Gen1 with Sparklyr

Learn how to resolve errors that occur when reading data from Azure Data Lake Storage Gen1 with Sparklyr in Databricks....

Last updated: December 9th, 2022 by Adam Pavlacka

Long jobs fail when accessing ADLS

Long running jobs that use Azure AD credential passthrough to access ADLS fail after 1 hour....

Last updated: December 9th, 2022 by huaming.liu

ADLS and WASB writes are being throttled

Learn how to resolve a "files and folders are being created at too high a rate" ADLS or WASB storage error....

Last updated: December 9th, 2022 by Adam Pavlacka

Unable to access Azure Data Lake Storage (ADLS) Gen1 when firewall is enabled

Learn how to troubleshoot access issues when connecting to Azure Data Lake Storage Gen 1 from Databricks with a firewall enabled....

Last updated: December 9th, 2022 by Adam Pavlacka

SQL access control error when using Snowflake as a data source

Snowflake does not officially support schema as an option; you must use sfschema....

Last updated: January 20th, 2023 by John.Lourdu

Databricks File System (Azure)

Cannot read Databricks objects stored in the DBFS root directory

Learn what to do when you cannot read Databricks objects stored in the DBFS root directory....

Last updated: March 8th, 2022 by Adam Pavlacka

How to specify the DBFS path

Learn how to specify the DBFS path in Apache Spark, Bash, DBUtils, Python, and Scala....

Last updated: December 9th, 2022 by ram.sankarasubramanian

Operation not supported during append

...

Last updated: July 7th, 2022 by Adam Pavlacka

Parallelize filesystem operations

Parallelize Apache Spark filesystem operations with DBUtils and Hadoop FileUtil; emulate DistCp....

Last updated: August 4th, 2022 by sandeep.chandran

Upload large files using DBFS API 2.0 and PowerShell

Use PowerShell and the DBFS API to upload large files to your Databricks workspace....

Last updated: September 27th, 2022 by ravirahul.padmanabhan

Remount a storage account after rotating access keys

Cannot access storage after rotating access keys until all mount points using the account have been remounted....

Last updated: December 9th, 2022 by dayanand.devarapalli

FileReadException on DBFS mounted filesystem

Use dbutils.fs.refreshMounts() to refresh mount points before referencing a DBFS path in your Spark job....

Last updated: April 11th, 2023 by Gobinath.Viswanathan

Databricks SQL (Azure)

Null column values display as NaN

Null column values correctly display as NaN in Databricks SQL....

Last updated: March 4th, 2022 by Adam Pavlacka

Retrieve queries owned by a disabled user

How to retrieve queries owned by a disabled user in Databricks SQL....

Last updated: March 4th, 2022 by John.Lourdu

Job timeout when connecting to a SQL endpoint over JDBC

Increase the SocketTimeout value in the JDBC connection URL to prevent thread requests from timing out....

Last updated: January 20th, 2023 by Atanu.Sarkar

Slowness when fetching results in Databricks SQL

Ensure that cloud fetch is enabled for best performance when using ODBC/JDBC to fetch results....

Last updated: February 3rd, 2023 by emad.rizkallah

ZORDER results in "Hilbert indexing can only be used on 9 or fewer columns" error

OPTIMIZE ZORDER BY command has a hard limit of nine columns....

Last updated: March 15th, 2023 by emad.rizkallah

Cannot customize Apache Spark config in Databricks SQL warehouse

You can only configure a limited set of global Spark properties when using a SQL warehouse....

Last updated: March 15th, 2023 by mounika.tarigopula

SQL warehouse launch fails to start with "PERMISSION_DENIED"

Ensure the necessary permissions are provided to the SQL warehouse owner...

Last updated: January 3rd, 2024 by rohit.menon

Update the Databricks SQL warehouse owner

Learn how to use the API to transfer ownership of a SQL warehouse to a new owner...

Last updated: February 29th, 2024 by simran.arora

Developer tools (Azure)

Apache Spark session is null in DBConnect

A `sparkSession is null while trying to executeCollectResult` error message occurs when using DBConnect....

Last updated: April 1st, 2022 by Jose Gonzalez

Databricks Connect reports version error with Databricks Runtime 6.4

...

Last updated: May 9th, 2022 by rakesh.parija

Failed to create process error with Databricks CLI in Windows

Databricks CLI may not work correctly in Windows if your Python path has a space in it....

Last updated: May 9th, 2022 by John.Lourdu

GeoSpark undefined function error with DBConnect

Use GeoSpark code with a DBConnect session....

Last updated: June 1st, 2022 by arjun.kaimaparambilrajan

Get Apache Spark config in DBConnect

Use a REST API call and DBConnect to get the Apache Spark configuration for your cluster....

Last updated: May 9th, 2022 by arvind.ravish

Invalid Access Token error when running jobs with Airflow

Learn what to do when you receive an Invalid Access Token error when using Databricks jobs with Airflow....

Last updated: May 9th, 2022 by Adam Pavlacka

ProtoSerializer stack overflow error in DBConnect

A stack overflow error in DBConnect indicates that you need to allocate more memory on the local PC....

Last updated: May 9th, 2022 by ashritha.laxminarayana

Use tcpdump to create pcap files

Analyze network traffic between nodes on a specific cluster by using tcpdump to create pcap files....

Last updated: April 10th, 2023 by pavan.kumarchalamcharla

Terraform registry does not have a provider error

You cannot install the Databricks Terraform provider if the required_providers block is not defined in your modules....

Last updated: August 16th, 2022 by prabakar.ammeappin

Common errors using Azure Data Factory

Learn about solutions and explanations for common errors when using Azure Data Factory with Azure Databricks....

Last updated: February 23rd, 2023 by Adam Pavlacka

Databricks Connect job fails after a Databricks Runtime update

Use the most recent version of Databricks Connect that matches your Databricks Runtime version to avoid an error....

Last updated: July 27th, 2023 by Rajeev kannan Thangaiah

Delta Lake (Azure)

A file referenced in the transaction log cannot be found

A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement....

Last updated: May 10th, 2022 by Adam Pavlacka

Compare two versions of a Delta table

Use time travel to compare two versions of a Delta table....

Last updated: May 10th, 2022 by mathan.pillai

Converting from Parquet to Delta Lake fails

Converting a file from Parquet to Delta Lake fails with a partition error when you have a subdirectory. Expecting 0 partition column(s), but found 1 partition column(s)...

Last updated: May 10th, 2022 by Jose Gonzalez

Delta Merge cannot resolve nested field

Delta Merge fails with a `Delta Merge cannot resolve 'field' due to data type mismatch` error message....

Last updated: May 10th, 2022 by Adam Pavlacka

Delete your streaming query checkpoint and restart

Delta table doesn't exist. Please delete your streaming query checkpoint and restart....

Last updated: May 10th, 2022 by Adam Pavlacka

How Delta cache behaves on an autoscaling cluster

Learn how Delta cache behaves on an autoscaling cluster....

Last updated: May 10th, 2022 by Adam Pavlacka

How to improve performance of Delta Lake MERGE INTO queries using partition pruning

Learn how to use partition pruning to improve the performance of Delta Lake MERGE INTO queries....

Last updated: June 1st, 2023 by Adam Pavlacka

Best practices for dropping a managed Delta Lake table

Learn the best practices for dropping a managed Delta Lake table....

Last updated: May 10th, 2022 by Adam Pavlacka

How to populate or update columns in an existing Delta table

Learn how to populate or update columns in an existing Delta table....

Last updated: May 10th, 2022 by Adam Pavlacka

Identify duplicate data on append operations

...

Last updated: May 10th, 2022 by chetan.kardekar

Optimize a Delta sink in a structured streaming application

Optimize your Delta sink by using a mod value on the batchId to optimize when foreachBatch runs....

Last updated: May 10th, 2022 by mathan.pillai

Delta Lake UPDATE query fails with IllegalState exception

Learn how to resolve an issue with Delta Lake UPDATE, DELETE, or MERGE queries that use Python UDFs....

Last updated: May 10th, 2022 by Adam Pavlacka

Unable to cast string to varchar

Use varchar type in Databricks Runtime 8.0 and above. It can only be used in table schema. It cannot be used in functions or operators....

Last updated: May 10th, 2022 by DD Sharma

Vaccuming with zero retention results in data loss

Do not disable spark.databricks.delta.retentionDurationCheck.enabled and run vacuum with retention zero to avoid data loss....

Last updated: October 7th, 2022 by DD Sharma

Z-Ordering will be ineffective, not collecting stats

Z-Ordering is ineffective, error about not collecting stats. Reorder table so the columns you want to optimize on are within the first 32 columns....

Last updated: May 10th, 2022 by mathan.pillai

Change cluster config for Delta Live Table pipeline

Customize the cluster configuration when using a Delta Live Table pipeline....

Last updated: July 1st, 2022 by pratik.bhawsar

Different tables with same data generate different plans when used in same query

Ensure that tables with the same data generate the same physical plans with Spark SQL....

Last updated: October 14th, 2022 by deepak.bhutada

Allow spaces and special characters in nested column names with Delta tables

Upgrade to Databricks Runtime 10.2 or later and use column mapping mode to allow spaces and special characters in column names....

Last updated: October 26th, 2022 by shanmugavel.chandrakasu

Delta writing empty files when source is empty

Delta can write empty files under Databricks Runtime 7.3 LTS. You should upgrade to Databricks Runtime 9.1 LTS or above to resolve the issue....

Last updated: December 2nd, 2022 by Rajeev kannan Thangaiah

Delta Live Tables pipelines are not running VACUUM automatically

You must have a maintenance cluster defined for VACUUM to run automatically....

Last updated: February 2nd, 2023 by priyanka.biswas

VACUUM best practices on Delta Lake

Learn best practices for using, and troubleshooting, VACUUM on Delta Lake....

Last updated: February 3rd, 2023 by mathan.pillai

OPTIMIZE is only supported for Delta tables error on Delta Lake

Use CREATE OR REPLACE TABLE when moving Delta tables from one storage location to another....

Last updated: February 3rd, 2023 by mathan.pillai

Recover from a DELTA_LOG corruption error

Learn how to repair a Delta table that reports an IllegalStateException error when queried....

Last updated: February 17th, 2023 by gopinath.chandrasekaran

FileReadException when reading a Delta table

A FileReadException error occurs when you attempt to read from a Delta table. The underlying data has been deleted, or the storage blob was unmounted during a write....

Last updated: February 23rd, 2023 by Adam Pavlacka

Programmatically determine if a table is a Delta table or not

Use Python code in a Databricks notebook to determine if a table is a Delta table or not....

Last updated: March 16th, 2023 by mounika.tarigopula

RESOURCE_LIMIT_EXCEEDED error when querying a Delta Sharing table

Delta Sharing has limits on the metadata size of a shared table. If you exceed these limits it generates an error....

Last updated: April 19th, 2023 by Rajeev kannan Thangaiah

Found duplicate columns error blocks creation of a Delta table

Duplicate column names are not allowed in Delta tables....

Last updated: July 28th, 2023 by deepak.bhutada

Hive-style partitions not found on Delta table after enabling column mapping mode

Delta Lake column mapping does not support Hive-style partitions....

Last updated: February 21st, 2024 by Jose Gonzalez

Jobs (Azure)

Distinguish active and dead jobs

Learn how to distinguish between active and dead Databricks jobs....

Last updated: May 10th, 2022 by Adam Pavlacka

Spark job fails with Driver is temporarily unavailable

Job failure due to Driver being unavailable or unresponsive....

Last updated: April 17th, 2023 by Adam Pavlacka

How to delete all jobs using the REST API

Learn how to delete all Databricks jobs using the REST API....

Last updated: May 10th, 2022 by Adam Pavlacka

Job cluster limits on notebook output

Job clusters have a maximum notebook output size of 20 MB. If the output is larger, it results in an error....

Last updated: May 10th, 2022 by Jose Gonzalez

Job fails, but Apache Spark tasks finish

Your job fails, but all of the Apache Spark tasks have completed successfully. You are using spark.stop() or System.exit(0) in your code....

Last updated: May 10th, 2022 by harikrishnan.kunhumveettil

Job fails due to job rate limit

Learn how to resolve Databricks job failures due to job rate limits....

Last updated: April 17th, 2023 by Adam Pavlacka

Create table in overwrite mode fails when interrupted

Learn how to troubleshoot failures that occur when you rerun an Apache Spark write operation by cancelling the currently running job....

Last updated: May 10th, 2022 by Adam Pavlacka

Apache Spark Jobs hang due to non-deterministic custom UDF

Learn what to do when your Apache Spark job hangs due to a non-deterministic custom UDF....

Last updated: May 10th, 2022 by Adam Pavlacka

Apache Spark job fails with Failed to parse byte string

Apache Spark job fails with a Failed to parse byte string error....

Last updated: May 10th, 2022 by noopur.nigam

Apache Spark UI shows wrong number of jobs

Apache Spark UI shows the wrong number of active jobs....

Last updated: May 11th, 2022 by ashish

Job fails with atypical errors message

Job run is throttled and fails due to observing atypical errors message....

Last updated: May 11th, 2022 by Adam Pavlacka

Apache Spark job fails with maxResultSize exception

Learn what to do when an Apache Spark job fails with a maxResultSize exception....

Last updated: May 11th, 2022 by Adam Pavlacka

Databricks job fails because library is not installed

Learn how to prevent Databricks jobs from failing due to uninstalled libraries....

Last updated: May 11th, 2022 by Adam Pavlacka

Job failure due to Azure Data Lake Storage (ADLS) CREATE limits

Learn what to do when your Databricks job fails due to Azure Data Lake Storage CREATE limits....

Last updated: May 11th, 2022 by Adam Pavlacka

Job fails with invalid access token

Jobs that run more than 48 hours fail with invalid access token error when the dbutils token expires....

Last updated: May 11th, 2022 by manjunath.swamy

How to ensure idempotency for jobs

Learn how to ensure that jobs submitted through the Databricks REST API aren't duplicated if there is a retry after a request times out....

Last updated: May 11th, 2022 by Adam Pavlacka

Monitor running jobs with a Job Run dashboard

Learn about using the Job Run dashboard in a workspace....

Last updated: May 11th, 2022 by Adam Pavlacka

Streaming job has degraded performance

Streaming job has poor performance after stopping and restarting from same checkpoint....

Last updated: May 11th, 2022 by ashish

Task deserialization time is high

Configure cluster-installed libraries to install on executors at cluster launch vs executor launch to speed up your job task runs....

Last updated: February 23rd, 2023 by Adam Pavlacka

Pass arguments to a notebook as a list

Use a JSON file to temporarily store arguments that you want to use in your notebook....

Last updated: October 29th, 2022 by pallavi.gowdar

Uncommitted files causing data duplication

Partially uncommitted files from a failed write can result in apparent data duplication. Adjust VACUUM settings to resolve the issue....

Last updated: November 8th, 2022 by gopinath.chandrasekaran

Multi-task workflows using incorrect parameter values

If parallel tasks running on the same cluster use Scala companion objects the wrong values can be used due to sharing a single class in the JVM....

Last updated: December 5th, 2022 by Rajeev kannan Thangaiah

Job fails with Spark Shuffle FetchFailedException error

Disable the default Spark Shuffle service to work around a FetchFailedException error....

Last updated: December 5th, 2022 by shanmugavel.chandrakasu

Users unable to view job results when using remote Git source

Databricks does not manage permission for remote repos, so you must sync changes with a local notebook so non-admin users can view results....

Last updated: March 7th, 2023 by ravirahul.padmanabhan

Single scheduled job tries to run multiple times

Ensure your cron syntax is correct when scheduling jobs. A wildcard in the wrong space can produce unexpected results....

Last updated: January 20th, 2023 by monica.cao

Jobs failing with shuffle fetch failures

Shuffle fetch failures can happen if you have modified the Azure Databricks subnet CIDR range after deployment....

Last updated: February 23rd, 2023 by arjun.kaimaparambilrajan

Add custom tags to a Delta Live Tables pipeline

Manually edit the JSON configuration file to add custom tags....

Last updated: February 24th, 2023 by John.Lourdu

Update notification settings for jobs with the Jobs API

You can use the Jobs API to add email notifications to some, or all, of the jobs in your workspace....

Last updated: March 17th, 2023 by manoj.hegde

Spark image download failure error message

Learn how to troubleshoot the Spark image download failure error message....

Last updated: April 17th, 2023 by laila.haddad

Python kernel is unresponsive error message

Learn how to identify and troubleshoot the cause of an unresponsive Python kernel error....

Last updated: April 17th, 2023 by laila.haddad

Stop all scheduled jobs

Use the included sample code to stop all of your scheduled jobs in the workspace....

Last updated: June 7th, 2023 by simran.arora

Job execution (Azure)

Increase the number of tasks per stage

Learn how to increase the number of tasks per stage when using the spark-xml package with Databricks....

Last updated: May 11th, 2022 by Adam Pavlacka

Maximum execution context or notebook attachment limit reached

Learn what to do when the maximum execution context or notebook attachment limit is reached in Databricks....

Last updated: May 15th, 2023 by rakesh.parija

Serialized task is too large

Learn what to do when a serialized task is too large in Databricks....

Last updated: March 15th, 2023 by Adam Pavlacka

Libraries (Azure)

Cannot import module in egg library

The module in the egg library cannot be imported. Easy install, Python....

Last updated: May 11th, 2022 by xin.wang

Cannot import TabularPrediction from AutoGluon

Cannot import TabularPrediction from AutoGluon v0.0.14 due to a namespace collision. Upgrade to AutoGluon v0.0.15....

Last updated: May 11th, 2022 by kavya.parag

Latest PyStan fails to install on Databricks Runtime 6.4

PyStan 3 doesn't install on Databricks Runtime 6.4 ES....

Last updated: May 11th, 2022 by rakesh.parija

Library unavailability causing job failures

Learn how to resolve Databricks job failures caused by unavailable libraries....

Last updated: May 11th, 2022 by Adam Pavlacka

How to correctly update a Maven library in Databricks

Learn how to correctly update a Maven library in Databricks....

Last updated: May 11th, 2022 by Adam Pavlacka

Init script fails to download Maven JAR

Cluster init script fails to download a Maven JAR when trying to install a library....

Last updated: May 11th, 2022 by arvind.ravish

Install package using previous CRAN snapshot

Avoid a package install error by installing from an earlier CRAN snapshot....

Last updated: May 11th, 2022 by darshan.bargal

Install PyGraphViz

Install PyGraphViz with all required dependencies....

Last updated: May 11th, 2022 by pavan.kumarchalamcharla

Install Turbodbc via init script

Install Turbodbc and its dependencies, libboost-all-dev, unixodbc-dev, and python-dev, with an init script....

Last updated: May 11th, 2022 by John.Lourdu

Cannot uninstall library from UI

Learn what to do when you can't uninstall a library using the Databricks user interface....

Last updated: May 11th, 2022 by Adam Pavlacka

Error when installing Cartopy on a cluster

Cartopy installation fails if libgeos and libproj are not installed....

Last updated: May 11th, 2022 by prem.jayaraj

Error when installing pyodbc on a cluster

Learn how to troubleshoot an error when installing pyodbc on a Databricks cluster....

Last updated: May 11th, 2022 by Adam Pavlacka

Libraries fail with dependency exception

Learn why notebook-scoped libraries trigger an Apache Spark dependency exception; return a requirement cannot be satisfied error....

Last updated: May 11th, 2022 by jordan.hicks

Libraries failing due to transient Maven issue

Library resolution failed. Cannot download some libraries due to transient Maven issue....

Last updated: May 11th, 2022 by dayanand.devarapalli

Reading .xlsx files with xlrd fails

xlrd no longer supports .xlsx files. Use openpyxl to read .xlsx files....

Last updated: May 12th, 2022 by prakash.jha

Remove Log4j 1.x JMSAppender and SocketServer classes from classpath

Remove Log4j 1.x JMSAppender and SocketServer classes from classpath....

Last updated: May 16th, 2022 by Adam Pavlacka

Replace a default library jar

Learn how to replace a default Java or Scala library jar with another version....

Last updated: May 16th, 2022 by ram.sankarasubramanian

Python command fails with AssertionError: wrong color format

Resolve a wrong color format AssertionError caused by nbconvert when a Python command fails....

Last updated: May 16th, 2022 by John.Lourdu

PyPMML fails with Could not find py4j jar error

...

Last updated: May 16th, 2022 by arjun.kaimaparambilrajan

TensorFlow fails to import

TensorFlow fails to import if you have an incompatible version of protobuf installed on your cluster....

Last updated: May 16th, 2022 by kavya.parag

Verify the version of Log4j on your cluster

Verify the version of Log4j installed on your cluster and upgrade if required....

Last updated: May 16th, 2022 by Adam Pavlacka

Apache Spark jobs fail with Environment directory not found error

Spark jobs appear to time out after you install a library because security rules are preventing workers from resolving the Python executable path....

Last updated: July 1st, 2022 by Adam Pavlacka

Copy installed libraries from one cluster to another

Copy libraries from a source cluster to a target cluster with a custom Python script....

Last updated: January 6th, 2023 by manoj.hegde

Failed to install Elasticsearch via Maven

If library dependencies are already installed, it can result in a library installation failure....

Last updated: March 17th, 2023 by ankitha.vijayanandana

Cluster fails to start with InvalidGroup.NotFound error

If the network security group policy is not correctly configured your clusters will fail to start....

Last updated: December 21st, 2023 by Adam Pavlacka

Add libraries to a job cluster to reduce idle time

How to add libraries to a job cluster and reduce idle time in Databricks...

Last updated: December 4th, 2023 by Adam Pavlacka

PyArrow hotfix breaking change

PyArrow versions 0.14 - 14.0.0 contain a security vulnerability....

Last updated: December 6th, 2023 by Adam Pavlacka

OpenSSL SSL_connect: SSL_ERROR_SYSCALL error

Use a cluster-scoped init script to install necessary SSL certificates to resolve a SSL_ERROR_SYSCALL error....

Last updated: February 29th, 2024 by pavan.kumarchalamcharla

Machine learning (Azure)

Conda fails to download packages from Anaconda

Conda fails to download packages with PackagesNotFoundError when you try to install packages from Anaconda....

Last updated: May 16th, 2022 by mathan.pillai

Download artifacts from MLflow

How to download artifacts from MLflow to local storage....

Last updated: May 16th, 2022 by shanmugavel.chandrakasu

How to extract feature information for tree-based Apache SparkML pipeline models

Learn how to extract feature information for tree-based ML pipeline models in Databricks....

Last updated: May 16th, 2022 by Adam Pavlacka

Fitting an Apache SparkML model throws error

Learn how to resolve errors thrown by Databricks when fitting a SparkML model or pipeline....

Last updated: May 16th, 2022 by Adam Pavlacka

H2O.ai Sparkling Water cluster not reachable

H2O.ai Sparkling Water cluster not reachable if the version of the Sparkling Water package does not match the version of Spark used on your cluster....

Last updated: May 16th, 2022 by shanmugavel.chandrakasu

How to perform group K-fold cross validation with Apache Spark

Learn how to perform group K-fold cross validation with Apache Spark on Databricks....

Last updated: February 24th, 2023 by Adam Pavlacka

Error when importing OneHotEncoderEstimator

You get an error message when trying to import OneHotEncoderEstimator....

Last updated: May 16th, 2022 by Shyamprasad Miryala

MLflow project fails to access an Apache Hive table

Resolve "Table or view not found" error when an MLflow project fails to access an Apache Hive table....

Last updated: May 16th, 2022 by vikas.yadav

How to speed up cross-validation

Learn how to improve cross-validation performance in SparkML with Databricks....

Last updated: May 16th, 2022 by Adam Pavlacka

Hyperopt fails with maxNumConcurrentTasks error

Do NOT install Hyperopt on a Databricks Runtime for Machine Learning cluster....

Last updated: May 16th, 2022 by chetan.kardekar

Incorrect results when using documents as inputs

Your model does not return expected results when documents are input using TfidfVectorizer. JSON array...

Last updated: May 16th, 2022 by pradeepkumar.palaniswamy

Errors when accessing MLflow artifacts without using the MLflow client

Resolve errors when attempting to access MLflow artifacts without using the MLflow client...

Last updated: May 16th, 2022 by Adam Pavlacka

Experiment warning when custom artifact storage location is used

Resolve experiment warnings when a custom artifact storage location is used instead of the MLflow managed location....

Last updated: May 16th, 2022 by Adam Pavlacka

Experiment warning when legacy artifact storage location is used

Resolve experiment warnings when a legacy artifact storage location is used instead of the MLflow managed location....

Last updated: May 16th, 2022 by Adam Pavlacka

KNN model using pyfunc returns ModuleNotFoundError or FileNotFoundError

Predictions using pyfunc on a KNN model returns a ModuleNotFoundError or FileNotFoundError....

Last updated: May 16th, 2022 by pradeepkumar.palaniswamy

OSError when accessing MLflow experiment artifacts

Resolve an `OSError` when trying to access, download, or log MLflow experiment artifacts....

Last updated: May 16th, 2022 by Adam Pavlacka

PERMISSION_DENIED error when accessing MLflow experiment artifact

Resolve a PERMISSION_DENIED error when trying to access MLflow experiment artifacts....

Last updated: May 16th, 2022 by Adam Pavlacka

Python commands fail on Machine Learning clusters

Python commands are failing on Databricks Runtime for Machine Learning clusters. Conda....

Last updated: May 16th, 2022 by arjun.kaimaparambilrajan

Runs are not nested when SparkTrials is enabled in Hyperopt

When SparkTrials is enabled in Hyperopt, MLflow runs are not nested under the parent run....

Last updated: May 16th, 2022 by pradeepkumar.palaniswamy

MLflow 'invalid access token' error

Long running ML tasks require an access token with an extended lifetime to ensure the tasks complete before the token expires....

Last updated: July 22nd, 2022 by shanmugavel.chandrakasu

Metastore (Azure)

Autoscaling is slow with an external metastore

Improve autoscaling performance by only installing metastore jars to the driver....

Last updated: May 16th, 2022 by Gobinath.Viswanathan

Data too long for column error

If a column exceeds 4000 characters it is too big for the default datatype and returns an error....

Last updated: May 16th, 2022 by Adam Pavlacka

Drop database without deletion

Use Hive commands to drop a database without deleting the underlying storage folder....

Last updated: May 24th, 2022 by arvind.ravish

How to create table DDLs to import into an external metastore

Learn how to export all table metadata from Hive to an external metastore from Databricks....

Last updated: May 16th, 2022 by Adam Pavlacka

Drop tables with corrupted metadata from the metastore

Learn how to drop tables that contain corrupted metadata from a metastore....

Last updated: May 16th, 2022 by Adam Pavlacka

Error in CREATE TABLE with external Hive metastore

CREATE TABLE error with MySQL 8.0 in external Hive metastore due to charset....

Last updated: May 16th, 2022 by jordan.hicks

AnalysisException when dropping table on Azure-backed metastore

Learn how to overcome an AnalysisException when dropping a table on an Azure-backed metastore....

Last updated: May 16th, 2022 by Adam Pavlacka

How to troubleshoot several Apache Hive metastore problems

Learn how to troubleshoot Apache Hive metastore problems....

Last updated: August 1st, 2023 by Adam Pavlacka

Listing table names

Learn how to list table names in Databricks....

Last updated: May 16th, 2022 by Adam Pavlacka

How to set up an embedded Apache Hive metastore

Learn how to set up an embedded Apache Hive metastore with Databricks....

Last updated: May 16th, 2022 by Adam Pavlacka

Japanese character support in external metastore

Use Japanese characters in tables in an external metastore....

Last updated: May 16th, 2022 by Adam Pavlacka

Parquet timestamp requires Hive metastore 1.2 or above

Update the Hive metastore to version 1.2 or above to use TIMESTAMP with a Parquet table....

Last updated: May 16th, 2022 by rakesh.parija

Failed to create query error when upgrading external metastore to Unity Catalog

The "Create query for upgrade" command only works when run on a warehouse in Data Explorer....

Last updated: March 15th, 2023 by Atanu.Sarkar

Metrics (Azure)

How to explore Apache Spark metrics with Spark listeners

Learn how to explore Apache Spark metrics using Spark listeners with Databricks....

Last updated: May 16th, 2022 by Adam Pavlacka

How to use Apache Spark metrics

Learn how to use Apache Spark metrics with Databricks....

Last updated: May 16th, 2022 by Adam Pavlacka

Notebooks (Azure)

How to check if a spark property is modifiable in a notebook

Learn how to modify Spark properties in a Databricks notebook....

Last updated: May 16th, 2022 by Adam Pavlacka

JSON reader parses values as null

When you read a JSON file, the Spark JSON reader returns null values instead of the actual data....

Last updated: May 16th, 2022 by saritha.shivakumar

Common errors in notebooks

Learn about common errors from Databricks notebooks....

Last updated: May 16th, 2022 by Adam Pavlacka

display() does not show microseconds correctly

Use show() to display timestamp values with microsecond precision. display() is limited to millisecond precision....

Last updated: May 16th, 2022 by harikrishnan.kunhumveettil

Error: Received command c on object id p0

You see the error message `INFO:py4j.java_gateway:Received command c on object id p0` after running Python code with imported libraries....

Last updated: August 21st, 2023 by rakesh.parija

Failure when accessing or mounting storage

Do not mount storage to the root mount path....

Last updated: May 16th, 2022 by kiran.bharathi

Access notebooks owned by a deleted user

How to access Databricks notebooks owned by a deleted user....

Last updated: May 16th, 2022 by John.Lourdu

Notebook autosave fails due to file size limits

Learn what to do when your Databricks notebook fails to autosave due to file size limits....

Last updated: May 16th, 2022 by Adam Pavlacka

Troubleshooting unresponsive Python notebooks or canceled commands

Learn how to troubleshoot unresponsive Python notebooks and cancelled commands in Databricks notebooks....

Last updated: May 17th, 2022 by Adam Pavlacka

Update job permissions for multiple users

Use the job permissions API to update permissions for multiple users....

Last updated: May 17th, 2022 by Atanu.Sarkar

Generate browser HAR files

Learn how to record HAR files in your web browser. These are very useful when troubleshooting UI issues....

Last updated: November 16th, 2023 by vivian.wilfred

Recover deleted notebooks from the Trash

Deleted items can be recovered from the Trash for 30 days after deletion....

Last updated: September 2nd, 2022 by vivian.wilfred

Get workspace configuration details

Display the complete configuration details for your Databricks workspace....

Last updated: February 29th, 2024 by kavya.parag

Iterate through all jobs in the workspace using Jobs API 2.1

Use the Jobs API 2.1 to iterate through and display a list of jobs in your workspace....

Last updated: July 28th, 2023 by debayan.mukherjee

Too many execution contexts are open right now

Reduce the number of notebooks used to limit the number of execution contexts required for your job....

Last updated: May 31st, 2023 by akash.bhat

Use an Azure AD service principal as compute ACL

Create an Azure AD service principal and use it for access control....

Last updated: December 21st, 2022 by venkatasai.vanaparthi

Python kernel is unresponsive error message

Learn how to identify and troubleshoot the cause of an unresponsive Python kernel error....

Last updated: April 17th, 2023 by laila.haddad

Generate a list of all workspace admins

Use the included sample code to generate a list of all workspace admins....

Last updated: June 7th, 2023 by simran.arora

Security and permissions (Azure)

Table creation fails with security exception

Learn what to do when table creation fails with a security exception....

Last updated: May 17th, 2022 by Adam Pavlacka

Troubleshoot key vault access issues

Troubleshoot Azure key vault access issues. Verify firewall. Enable secrets....

Last updated: February 25th, 2023 by arvind.ravish

Connection retries take a long time to fail

The default Apache Hadoop values for connection timeout and retry are high, reduce the values for quicker failures....

Last updated: December 21st, 2022 by sivaprasad.cs

"Unable to update Group Push mapping target" error when syncing Okta groups to workspace

You cannot SCIM sync users and groups directly to the workspace when identity federation is enabled....

Last updated: May 5th, 2023 by sivaprasad.cs

Bulk update workflow permissions for a group

Use this sample code to update a single group's permissions for all the jobs in a workspace....

Last updated: February 22nd, 2024 by simran.arora

Streaming (Azure)

Append output is not supported without a watermark

Append output mode is not supported on aggregated DataFrames without a watermark....

Last updated: May 17th, 2022 by Adam Pavlacka

Apache Spark DStream is not supported

DStreams are not supported in Databricks. Migrate from DStream API to Structured Streaming....

Last updated: May 17th, 2022 by Adam Pavlacka

Streaming with File Sink: Problems with recovery if you change checkpoint or output directories

Learn how to resolve issues that occur with recovery if you change checkpoint or output directories when streaming with File Sink....

Last updated: May 17th, 2022 by Adam Pavlacka

Get the path of files consumed by Auto Loader

Get the path and filename of all files consumed by Auto Loader and write them out as a new column....

Last updated: May 18th, 2022 by Adam Pavlacka

How to restart a structured streaming query from last written offset

Learn how to restart a structured streaming query from the last written offset....

Last updated: May 18th, 2022 by Adam Pavlacka

Kafka error: No resolvable bootstrap urls

A 'No resolvable bootstrap urls' error occurs when you try to read or write data to a Kafka stream....

Last updated: May 18th, 2022 by Adam Pavlacka

readStream() is not whitelisted error when running a query

readStream() is not whitelisted error on clusters that have table access control enabled....

Last updated: May 19th, 2022 by mathan.pillai

Checkpoint files not being deleted when using display()

Learn how to prevent display(streamingDF) checkpoint files from using a large amount of storage....

Last updated: May 19th, 2022 by Adam Pavlacka

Checkpoint files not being deleted when using foreachBatch()

Learn how to prevent foreachBatch() checkpoint files from using a large amount of storage....

Last updated: May 19th, 2022 by Adam Pavlacka

Conflicting directory structures error

You should use distinct paths in the storage location, otherwise conflicting directory structures may result in an error....

Last updated: May 19th, 2022 by ashish

RocksDB fails to acquire a lock

When using RocksDB as a state store, you may need to increase the acquire timeout in the SQL config....

Last updated: February 25th, 2023 by Adam Pavlacka

Stream XML files using an auto-loader

Stream XML files on Databricks by combining the auto-loading features of the Spark batch API with the OSS library Spark-XML....

Last updated: May 19th, 2022 by Adam Pavlacka

Streaming job gets stuck writing to checkpoint

Streaming job appears to be stuck even though no error is thrown. You are using DBFS for checkpoint storage, but it has filled up....

Last updated: May 19th, 2022 by Jose Gonzalez

Explicit path to data or a defined schema required for Auto loader

If you do not specify an explicit path to your data or define your data schema, you get an IllegalArgumentException error when you start an Auto loader job....

Last updated: October 12th, 2022 by Jose Gonzalez

Optimize streaming transactions with .trigger

Use .trigger to define the storage update interval. A higher value reduces the number of storage transactions....

Last updated: October 26th, 2022 by chetan.kardekar

Structured streaming jobs slow down on every 10th batch

Automatic compaction of the metadata folder can slow down structured streaming jobs....

Last updated: October 28th, 2022 by gopinath.chandrasekaran

Get last modification time for all files in Auto Loader and batch jobs

Define a UDF to list all files in the path and return the last modification time for each one....

Last updated: December 1st, 2022 by DD Sharma

Stream to stream join failure

Avoid using a memory sink when running streaming queries with stream to stream join....

Last updated: January 18th, 2024 by harikrishnan.kunhumveettil

Offset reprocessing issues in streaming queries with a Kafka source

Resolve Kafka offset reprocessing issues in Structured Streaming by using a new checkpoint directory....

Last updated: January 19th, 2024 by harikrishnan.kunhumveettil

Autoloader job fails with a URISyntaxException error due to invalid characters in filenames

When using Directory listing mode you should not process files with colons in the filename. ...

Last updated: January 19th, 2024 by harikrishnan.kunhumveettil

Auto Loader streaming job failure with schema inference error

To selectively read a specific type of file using Auto Loader, use the pathGlobFilter option....

Last updated: February 29th, 2024 by harikrishnan.kunhumveettil

Auto Loader streaming query failure with unknownFieldException error

Use schema evolution to avoid streaming query failures when new columns are added to your data....

Last updated: February 29th, 2024 by harikrishnan.kunhumveettil

Visualizations (Azure)

How to save Plotly files and display From DBFS

Learn how to save Plotly files and display them from DBFS....

Last updated: May 19th, 2022 by Adam Pavlacka

Python with Apache Spark (Azure)

AttributeError: ‘function’ object has no attribute

Using protected keywords from the DataFrame API as column names results in a function object has no attribute error message....

Last updated: May 19th, 2022 by noopur.nigam

Convert Python datetime object to string

Display date and time values in a column, as a datetime object, and as a string....

Last updated: May 19th, 2022 by Adam Pavlacka

Create a cluster with Conda

Learn how to create a Databricks cluster with Conda....

Last updated: May 19th, 2022 by Adam Pavlacka

Display file and directory timestamp details

Display file creation date and modification date using Python....

Last updated: May 19th, 2022 by rakesh.parija

Install and compile Cython

Learn how to install and compile Cython with Databricks....

Last updated: May 19th, 2022 by Adam Pavlacka

Reading large DBFS-mounted files using Python APIs

Learn how to resolve errors when reading large DBFS-mounted files using Python APIs....

Last updated: May 19th, 2022 by Adam Pavlacka

Use the HDFS API to read files in Python

Learn how to read files directly by using the HDFS API in Python....

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

How to import a custom CA certificate

Learn how to import a custom CA certificate into your Databricks cluster for Python use....

Last updated: February 29th, 2024 by arjun.kaimaparambilrajan

Job remains idle before starting

Apache Spark jobs remain idle for a long time before starting....

Last updated: May 19th, 2022 by ashish

List all workspace objects

List all Databricks workspace objects under a given path....

Last updated: May 19th, 2022 by Adam Pavlacka

Load special characters with Spark-XML

Special characters are not rendering correctly. Use charset with Spark-XML....

Last updated: May 19th, 2022 by annapurna.hiriyur

Python commands fail on high concurrency clusters

Python commands fail on high concurrency clusters with Apache Spark process isolation and shared session enabled. WARN error message....

Last updated: May 19th, 2022 by xin.wang

Cluster cancels Python command execution after installing Bokeh

Learn what to do when your Databricks cluster cancels Python command execution after you install Bokeh....

Last updated: May 19th, 2022 by Adam Pavlacka

Cluster cancels Python command execution due to library conflict

Learn what to do when your Databricks cluster cancels Python command execution due to a library conflict....

Last updated: May 19th, 2022 by Adam Pavlacka

Python command execution fails with AttributeError

Learn what to do when a Python command in your Databricks notebook fails with AttributeError....

Last updated: May 19th, 2022 by Adam Pavlacka

Python REPL fails to start in Docker

Learn how to fix a Python virtualenv error that prevents REPL from starting in a Docker container...

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

How to run SQL queries from Python scripts

Learn how to run SQL queries using Python scripts....

Last updated: May 19th, 2022 by arjun.kaimaparambilrajan

Run C++ code in Python

Learn how to run C++ code in Python....

Last updated: May 19th, 2022 by Adam Pavlacka

Python 2 sunset status

Learn about the sunset status of Python 2 in Databricks....

Last updated: May 19th, 2022 by Adam Pavlacka

Job fails with Java IndexOutOfBoundsException error

When groupby() is used along with applyInPandas it generates an exception due to an arrow buffer limitation....

Last updated: December 21st, 2022 by rakesh.parija

Job fails with NoSuchElementException error

NoSuchElementException errors can occur when using Apache Arrow....

Last updated: March 3rd, 2023 by ashish

Job fails with IndexOutOfBoundsException and ArrowBuf errors

When Groupby is used with applyinPandas it can result in Apache Arrow buffer size estimation errors....

Last updated: March 3rd, 2023 by ashish

Field name sorting changes in Apache Spark 3.x

Starting with Spark 3.0.0, rows created from named arguments do not have field names sorted alphabetically....

Last updated: April 21st, 2023 by sergios.lalas

Job fails with "not enough memory to build the hash map" error

You should use adaptive query execution instead of explicit broadcast hints to perform joins on Databricks Runtime 11.3 LTS and above....

Last updated: May 12th, 2023 by saritha.shivakumar

R with Apache Spark (Azure)

Change version of R (r-base)

Learn how to change the version of R on your Databricks cluster....

Last updated: May 20th, 2022 by Adam Pavlacka

Fix the version of R packages

Learn how to fix the version of R packages....

Last updated: May 20th, 2022 by Adam Pavlacka

How to parallelize R code with gapply

Learn how to parallelize R code using gapply....

Last updated: May 20th, 2022 by Adam Pavlacka

How to parallelize R code with spark.lapply

Learn how to parallelize R code using spark.lapply....

Last updated: May 20th, 2022 by Adam Pavlacka

How to persist and share code in RStudio

Learn how to share notebooks between Databricks and RStudio....

Last updated: May 20th, 2022 by Adam Pavlacka

Install rJava and RJDBC libraries

Learn how to install rJava and RJDBC libraries on your Databricks cluster....

Last updated: December 22nd, 2022 by Adam Pavlacka

Rendering an R markdown file containing sparklyr code fails

Learn how to resolve failures when rendering an R markdown file containing sparklyr....

Last updated: May 20th, 2022 by Adam Pavlacka

Resolving package or namespace loading error

Learn how to resolve package or namespace loading errors in a Databricks notebook....

Last updated: May 20th, 2022 by Adam Pavlacka

RStudio server backend connection error

RStudio server backend connection error occurs if you exceed the maximum number of RBackends on your cluster....

Last updated: May 20th, 2022 by arvind.ravish

Verify R packages installed via init script

Verify that R packages successfully installed via an init script. List all R packages that failed to install....

Last updated: May 20th, 2022 by kavya.parag

Scala with Apache Spark (Azure)

Apache Spark UI is not in sync with job

Status of Spark jobs gets out of sync with the Spark UI when events drop from the event queue before being processed....

Last updated: February 27th, 2023 by chetan.kardekar

Apache Spark job fails with Parquet column cannot be converted error

Parquet column cannot be converted error appears when you are reading decimal data in Parquet format and writing to a Delta table....

Last updated: May 20th, 2022 by shanmugavel.chandrakasu

Cannot import timestamp_millis or unix_millis

Cannot use timestamp_millis or unix_millis directly with a DataFrame. You must first use selectExpr() or use SQL commands....

Last updated: May 20th, 2022 by saritha.shivakumar

Cannot modify the value of an Apache Spark config

You cannot modify the value of a Spark config setting within a notebook. It must be set at the cluster level....

Last updated: May 20th, 2022 by Adam Pavlacka

Convert flattened DataFrame to nested JSON

How to convert a flattened DataFrame to nested JSON using a nested case class....

Last updated: May 20th, 2022 by Adam Pavlacka

Convert nested JSON to a flattened DataFrame

How to convert a flattened DataFrame to nested JSON using a nested case class....

Last updated: May 20th, 2022 by Adam Pavlacka

Create a DataFrame from a JSON string or Python dictionary

Create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary....

Last updated: July 1st, 2022 by ram.sankarasubramanian

from_json returns null in Apache Spark 3.0

Spark 3.0 and above cannot parse JSON arrays as structs; from_json returns null....

Last updated: May 23rd, 2022 by shanmugavel.chandrakasu

Intermittent NullPointerException when AQE is enabled

When adaptive query execution (AQE) is enabled, and cluster scales down and loses shuffle data, you can get a `NullPointerException` error....

Last updated: May 23rd, 2022 by mathan.pillai

Manage the size of Delta tables

Recommendations that can help you manage the size of your Delta tables....

Last updated: May 23rd, 2022 by Jose Gonzalez

Trouble reading external JDBC tables after upgrading from Databricks Runtime 5.5

Fail to read external JDBC tables after upgrading from Databricks Runtime 5.5 to 6.0 and above....

Last updated: May 23rd, 2022 by Mohammed.Haseeb

Running C++ code in Scala

Learn how to run C++ code in Scala with this example notebook....

Last updated: May 23rd, 2022 by Adam Pavlacka

Select files using a pattern match

Use a glob pattern match to select specific files in a folder....

Last updated: May 23rd, 2022 by mathan.pillai

Multiple Apache Spark JAR jobs fail when run concurrently

Apache Spark JAR jobs failing with an AnalysisException error when run concurrently....

Last updated: February 28th, 2023 by Adam Pavlacka

Job fails with ExecutorLostFailure due to “Out of memory” error

Resolve executor failures where the root cause is due to the executor running out of memory.....

Last updated: November 7th, 2022 by mathan.pillai

Job fails with ExecutorLostFailure because executor is busy

Resolve executor failures where the root cause is due to the executor being busy....

Last updated: November 7th, 2022 by mathan.pillai

Understanding speculative execution

Learn how speculative execution works, how to identify it, and when you should use it....

Last updated: November 7th, 2022 by mounika.tarigopula

Use custom classes and objects in a schema

You must define custom classes and objects inside a package if you want to use them in a notebook. ...

Last updated: November 8th, 2022 by saritha.shivakumar

Jobs fails with a TimeoutException error

This error is usually caused by a Broadcast join that takes excessively long to complete....

Last updated: March 3rd, 2023 by swetha.nandajan

SQL with Apache Spark (Azure)

Broadcast join exceeds threshold, returns out of memory error

Resolve an Apache Spark OutOfMemorySparkException error that occurs when a table using BroadcastHashJoin exceeds the BroadcastJoinThreshold....

Last updated: May 23rd, 2022 by sandeep.chandran

Cannot grow BufferHolder; exceeds size limitation

Cannot grow BufferHolder by size because the size after growing exceeds limitation; java.lang.IllegalArgumentException error....

Last updated: May 23rd, 2022 by Adam Pavlacka

Date functions only accept int values in Apache Spark 3.0

Date functions only accept int values in Apache Spark 3.0; fractional and string values return AnalysisException error....

Last updated: February 28th, 2023 by Adam Pavlacka

Disable broadcast when query plan has BroadcastNestedLoopJoin

How to disable broadcast when the query plan has BroadcastNestedLoopJoin....

Last updated: May 23rd, 2022 by Adam Pavlacka

Duplicate columns in the metadata error

Spark job fails while processing a Delta table with org.apache.spark.sql.AnalysisException Found duplicate column(s) in the metadata error....

Last updated: May 23rd, 2022 by vikas.yadav

Generate unique increasing numeric values

Use Apache Spark functions to generate unique and increasing numbers in a column in a table in a file or DataFrame....

Last updated: May 23rd, 2022 by ram.sankarasubramanian

Error in SQL statement: AnalysisException: Table or view not found

Learn how to resolve the AnalysisException SQL error "Table or view not found"....

Last updated: May 23rd, 2022 by Adam Pavlacka

Error when downloading full results after join

If you have duplicate columns after a join, you will get an error when trying to download the full results....

Last updated: May 23rd, 2022 by manjunath.swamy

Error when running MSCK REPAIR TABLE in parallel

Do not run `MSCK REPAIR` commands in parallel. It results in a read timed out or out of memory error message....

Last updated: May 23rd, 2022 by ashritha.laxminarayana

Find the size of a table

How to find the size of a table....

Last updated: May 23rd, 2022 by mathan.pillai

Inner join drops records in result

Avoid dropped records when performing an inner join....

Last updated: May 23rd, 2022 by siddharth.panchal

JDBC write fails with a PrimaryKeyViolation error

JDBC write to a SQL database fails with a `PrimaryKeyViolation` error or results in duplicate data...

Last updated: May 24th, 2022 by harikrishnan.kunhumveettil

Query does not skip header row on external table

External Hive tables do not skip the header row when queried from Spark SQL....

Last updated: May 24th, 2022 by manisha.jena

SHOW DATABASES command returns unexpected column name

Running the `SHOW DATABASES` command returns an unexpected column name....

Last updated: May 24th, 2022 by Jose Gonzalez

Cannot view table SerDe properties

SHOW CREATE TABLE only returns the Apache Spark DDL. It does not show the SerDe properties....

Last updated: July 1st, 2022 by saritha.shivakumar

Parsing post meridiem time (PM) with to_timestamp() returns null

When converting 12-hour time to 24-hour time with to_timestamp() the hours variable must be lowercase....

Last updated: July 22nd, 2022 by chetan.kardekar

to_json() results in Cannot use null as map key error

You must filter or replace null values in your input data before using to_json()....

Last updated: July 22nd, 2022 by gopal.goel

Set nullability when using SaveAsTable with Delta tables

Learn how to create a Delta table with the nullability of columns set to false....

Last updated: October 14th, 2022 by anshuman.sahu

Ensure consistency in statistics functions between Spark 3.0 and Spark 3.1 and above

Statistics functions in Databricks Runtime 7.3 LTS and below return NaN when a divide by zero occurs. Set a Spark config to return null instead....

Last updated: October 14th, 2022 by chetan.kardekar

Using datetime values in Spark 3.0 and above

How to correctly use datetime functions in Spark SQL with Databricks runtime 7.3 LTS and above....

Last updated: October 26th, 2022 by deepak.bhutada

ANSI compliant DECIMAL precision and scale

Learn how to enable ANSI compliant error messages when incorrect values are used for DECIMAL precision and scale....

Last updated: October 29th, 2022 by saritha.shivakumar

Recreate LISTAGG functionality with Spark SQL

Use collect_list and concat_ws in Spark SQL to achieve the same functionality as LISTAGG on other platforms....

Last updated: February 24th, 2023 by manjunath.swamy

Decreased performance when using DELETE with a subquery on Databricks Runtime 10.4 LTS

Auto optimize should be disabled when you have a DELETE with a subquery where one side is small enough to be broadcast....

Last updated: April 21st, 2023 by sergios.lalas

Terraform (Azure)

Error when creating a user, group, or service principal at the account level with Terraform

You must include your account_id in the Terraform Databricks provider block to manage users, groups, and service principals....

Last updated: October 28th, 2022 by John.Lourdu

Unity Catalog (Azure)

Delta Live Tables (Azure)

Cannot select a Databricks Runtime version when using a Delta Live Tables pipeline

Delta Live Tables do not allow you to directly configure the Databricks Runtime version....

Last updated: April 20th, 2023 by Jose Gonzalez

Delta Live Tables job fails when using collect()

You should not use functions such as collect(), count(), toPandas(), save(), and saveAsTable() within the table and view function definitions....

Last updated: May 10th, 2023 by Jose Gonzalez

Databricks Knowledge Base

Contact Us

How to discover who deleted a cluster in Azure portal

How to discover who deleted a workspace in Azure portal

Find your workspace ID

Failed to add user error due to email or username already existing with a different case

Cannot access Databricks secrets when using a "No isolation shared" cluster

Configure custom DNS settings using dnsmasq

How to analyze user interface performance issues

Unable to mount Azure Data Lake Storage Gen1 account

Assign a single public IP for VNet-injected workspaces using Azure Firewall

Network configuration of Azure Data Lake Storage Gen1 causes ADLException: Error getting info for file

Jobs are not progressing in the workspace

SAS requires current ABFS client

Configure Simba ODBC driver with a proxy in Windows

Configure Simba JDBC driver using Azure AD

Power BI proxy and SSL configuration

Enable OpenJSSE and TLS 1.3

How to calculate the number of cores in a cluster

Install a private PyPI repo

IP access list update returns INVALID_STATE

Cannot apply updated cluster policy

Cluster Apache Spark configuration not applied

Cluster failed to launch

Custom Docker image requires root

Job fails due to cluster manager core instance request limit

Admin user cannot restart cluster to run job

Cluster fails to start with dummy does not exist error

Cluster slowdown due to Ganglia metrics filling root partition

Failed to create cluster with invalid tag value

Persist Apache Spark CSV metrics to a DBFS location

Replay Apache Spark events in a cluster

Set Apache Hadoop core-site.xml properties

Set executor log level

Apache Spark job doesn’t start

Auto termination is disabled when starting a job cluster

Unexpected cluster termination

How to configure single-core executors to run JNI libraries

How to overwrite log4j configurations on Databricks clusters

Apache Spark executor memory allocation

Apache Spark UI shows less than total node memory

Configure a cluster to use a custom NTP server

Enable GCM cipher suites

Enable retries in init script

Cannot set a custom PYTHONPATH

Run a custom Databricks runtime on your cluster

Cluster init script fails with mirror sync in progress error

Slow cluster launch and missing nodes

IP address limit prevents cluster creation

CPU core limit prevents cluster creation

Custom garbage collection prevents cluster launch

SSH to the cluster driver node

Adding a configuration setting overwrites all default spark.executor.extraJavaOptions settings

UnknownHostException on cluster launch

Pin cluster configurations using the API

Unpin cluster configurations using the API

R commands fail on custom Docker cluster

Apache Spark UI task logs intermittently return HTTP 500 error

Legacy global init script migration notebook

Python kernel is unresponsive error message

Spark image download failure error message

Disable cluster-scoped init scripts on DBFS

Cluster-named and cluster-scoped init script migration notebook

Cluster fails with Fatal uncaught exception error. Failed to bind.

Log delivery feature not generating log4j logs for executor folders

Use a cluster policy to disable Photon

Shorten cluster provisioning time by using Docker containers

DBFS init script detection notebook

Workspace is not UC enabled

Migration guidance for init scripts on DBFS

Append to a DataFrame

How to improve performance with bucketing

Simplify chained transformations

How to dump tables in CSV, JSON, XML, text, or HTML format

Hive UDFs

Prevent duplicated columns when joining two DataFrames

Revoke all user privileges

How to list and delete files faster in Databricks

How to handle corrupted Parquet files with different schema

No USAGE permission on database