无法在 JSON 文件中检测到编码Failure to detect encoding in JSON

问题Problem

Spark 作业失败,并出现包含以下消息的异常:Spark job fails with an exception containing the message:

Invalid UTF-32 character 0x1414141(above 10ffff)  at char #1, byte #7)
At org.apache.spark.sql.catalyst.json.JacksonParser.parse

原因Cause

JSON 数据源读取器可以在文件的开头使用BOM自动检测输入 JSON 文件的编码。The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. 但是,BOM 不是 Unicode 标准所必需的,并且RFC 7159不允许使用它,例如,第8.1 节:However, BOM is not mandatory by Unicode standard and prohibited by RFC 7159 for example, section 8.1:

"...实现不能将字节顺序标记添加到 JSON 文本的开头。“…Implementations MUST NOT add a byte order mark to the beginning of a JSON text.”

因此,在某些情况下,Spark 无法正确检测字符集,并读取 JSON 文件。As a consequence, in some cases Spark is not able to detect the charset correctly and read the JSON file.

解决方案Solution

若要解决此问题,请禁用字符集自动检测机制,并使用选项显式设置字符集 encodingTo solve the issue, disable the charset auto-detection mechanism and explicitly set the charset using the encoding option:

.option("encoding", "UTF-16LE")