How to Handle Corrupted Parquet Files with Different Schema
Let’s say you have a large list of essentially independent Parquet files, with a variety of different schemas. You want to read only those files that match a specific schema and skip the files that don’t match.
One solution could be to read the files in sequence, identify the schema, and union the
DataFrames together. However, this approach is impractical when there are hundreds of thousands of files.
Set the Apache Spark property
true and then read the files with the desired schema. Files that don’t match the specified schema are ignored. The resultant dataset contains only data from those files that match the specified schema.
Set the Spark property using
Alternatively, you can set this property in your Spark configuration.