Failure to detect encoding in JSON

Learn how to resolve a failure to detect encoding of input JSON files when using BOM with Databricks.

Last published at: June 1st, 2022

Problem

Spark job fails with an exception containing the message:

Invalid UTF-32 character 0x1414141(above 10ffff)  at char #1, byte #7)
At org.apache.spark.sql.catalyst.json.JacksonParser.parse

The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files.

However, BOM is not mandatory by Unicode standard and prohibited by RFC 7159.

For example, section 8.1 says, "Implementations MUST NOT add a byte order mark to the beginning of a JSON text."

As a consequence, Spark is not always able to detect the charset correctly and read the JSON file.

To solve the issue, disable the charset auto-detection mechanism and explicitly set the charset using the encoding option:

%scala

.option("encoding", "UTF-16LE")