無法偵測 JSON 的編碼Failure to detect encoding in JSON

問題Problem

Spark 作業失敗,並出現包含下列訊息的例外狀況:Spark job fails with an exception containing the message:

Invalid UTF-32 character 0x1414141(above 10ffff)  at char #1, byte #7)
At org.apache.spark.sql.catalyst.json.JacksonParser.parse

原因Cause

JSON 資料來源讀取器可以在檔案開頭使用BOM自動偵測輸入 JSON 檔案的編碼。The JSON data source reader is able to automatically detect encoding of input JSON files using BOM at the beginning of the files. 不過,「BOM」不是 Unicode standard 的必要項,而且不是由RFC 7159所禁止,例如,第8.1 節:However, BOM is not mandatory by Unicode standard and prohibited by RFC 7159 for example, section 8.1:

"...執行不能在 JSON 文字開頭加上位元組順序標記。」“…Implementations MUST NOT add a byte order mark to the beginning of a JSON text.”

因此,在某些情況下,Spark 無法正確地偵測到字元集並讀取 JSON 檔案。As a consequence, in some cases Spark is not able to detect the charset correctly and read the JSON file.

解決方法Solution

若要解決此問題,請停用字元集自動偵測機制,並使用選項明確設定字元集 encodingTo solve the issue, disable the charset auto-detection mechanism and explicitly set the charset using the encoding option:

.option("encoding", "UTF-16LE")