如何处理具有不同架构的已损坏 Parquet 文件How to handle corrupted Parquet files with different schema

问题Problem

假设你有一个很大的 Parquet 文件列表,其中包含各种不同的架构。Let’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. 只需读取与特定架构匹配的文件,并跳过不匹配的文件。You want to read only those files that match a specific schema and skip the files that don’t match.

一种解决方法是按顺序读取文件,确定架构,并将合并在 DataFrames 一起。One solution could be to read the files in sequence, identify the schema, and union the DataFrames together. 但是,当有成千上万个文件时,这种方法不切实际。However, this approach is impractical when there are hundreds of thousands of files.

解决方案Solution

将 Apache Spark 属性设置 spark.sql.files.ignoreCorruptFilestrue ,然后读取包含所需架构的文件。Set the Apache Spark property spark.sql.files.ignoreCorruptFiles to true and then read the files with the desired schema. 将忽略与指定的架构不匹配的文件。Files that don’t match the specified schema are ignored. 生成的数据集仅包含那些与指定架构匹配的文件中的数据。The resultant dataset contains only data from those files that match the specified schema.

使用以下内容设置 Spark 属性 spark.conf.setSet the Spark property using spark.conf.set:

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

或者,你可以在Spark 配置中设置此属性。Alternatively, you can set this property in your Spark configuration.