Apache Spark 作业失败,出现 maxResultSize 异常Apache Spark job fails with maxResultSize exception

问题Problem

Spark 作业失败并出现 maxResultSize 异常:A Spark job fails with a maxResultSize exception:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized
results of XXXX tasks (X.0 GB) is bigger than spark.driver.maxResultSize (X.0 GB)

原因Cause

之所以发生此错误,是因为超出了配置的大小限制。This error occurs because the configured size limit was exceeded. 大小限制适用于所有分区中 Spark 操作的序列化结果总数。The size limit applies to the total serialized results for Spark actions across all partitions. Spark 操作包括 collect() 对驱动程序节点的操作,或将 toPandas() 大型文件保存到驱动程序的本地文件系统。The Spark actions include actions such as collect() to the driver node, toPandas(), or saving a large file to the driver local file system.

解决方案Solution

在某些情况下,你可能必须重构代码以防止驱动程序节点收集大量数据。In some situations, you might have to refactor the code to prevent the driver node from collecting a large amount of data. 你可以更改代码,以便驱动程序节点收集有限数量的数据或增加驱动程序实例的内存大小。You can change the code so that the driver node collects a limited amount of data or increase the driver instance memory size. 例如,可以调用 toPandas 启用了箭头的或写入文件,然后读取文件,而不是将大量数据收集回驱动程序。For example you can call toPandas with Arrow enabled or writing files and then read those files instead of collecting large amounts of data back to the driver.

如果绝对有必要,可以将属性设置 spark.driver.maxResultSize<X>g 高于群集Spark 配置中异常消息报告的值的值:If absolutely necessary you can set the property spark.driver.maxResultSize to a value <X>g higher than the value reported in the exception message in the cluster Spark configuration:

spark.driver.maxResultSize <X>g

默认值为4g。The default value is 4g. 有关详细信息,请参阅应用程序属性For details, see Application Properties

如果设置了较高的限制,则可能会在驱动程序中出现内存不足错误(取决于 spark.driver.memory 和 JVM 中对象的内存开销)。If you set a high limit, out-of-memory errors can occur in the driver (depending on spark.driver.memory and the memory overhead of objects in the JVM). 设置适当的限制以防止出现内存不足错误。Set an appropriate limit to prevent out-of-memory errors.