Apache Spark 作業因 maxResultSize 例外狀況而失敗Apache Spark job fails with maxResultSize exception

問題Problem

Spark 作業失敗,發生 maxResultSize 例外狀況:A Spark job fails with a maxResultSize exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized
results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB)

原因Cause

之所以發生此錯誤,是因為已超過設定的大小限制。This error occurs because the configured size limit was exceeded. 大小限制適用于所有分割區上 Spark 動作的總序列化結果。The size limit applies to the total serialized results for Spark actions across all partitions. Spark 動作包括 collect() 對驅動程式節點的動作,或將 toPandas() 大型檔案儲存到驅動程式本機檔案系統。The Spark actions include actions such as collect() to the driver node, toPandas(), or saving a large file to the driver local file system.

解決方法Solution

在某些情況下,您可能必須重構程式碼,以防止 [驅動程式] 節點收集大量資料。In some situations, you might have to refactor the code to prevent the driver node from collecting a large amount of data. 您可以變更程式碼,讓 driver 節點收集有限數量的資料,或增加驅動程式實例記憶體大小。You can change the code so that the driver node collects a limited amount of data or increase the driver instance memory size. 例如,您可以呼叫 toPandas 並啟用箭號或寫入檔案,然後讀取這些檔案,而不是將大量資料收集回驅動程式。For example you can call toPandas with Arrow enabled or writing files and then read those files instead of collecting large amounts of data back to the driver.

如果絕對必要,您可以將屬性設定 spark.driver.maxResultSize<X>g 高於叢集Spark設定中例外狀況訊息中所報告值的值:If absolutely necessary you can set the property spark.driver.maxResultSize to a value <X>g higher than the value reported in the exception message in the cluster Spark configuration:

spark.driver.maxResultSize <X>g

預設值為4g。The default value is 4g. 如需詳細資訊,請參閱應用程式屬性For details, see Application Properties

如果您設定了高限制,驅動程式中可能會發生記憶體不足的錯誤(視 spark.driver.memory JVM 中物件的記憶體額外負荷而定)。If you set a high limit, out-of-memory errors can occur in the driver (depending on spark.driver.memory and the memory overhead of objects in the JVM). 設定適當的限制,以防止發生記憶體不足的錯誤。Set an appropriate limit to prevent out-of-memory errors.