Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark 1.6.1 - how to skip corrupted parquet blocks

avatar
Explorer

How to skip the corrupted block in parquet without getting exception.

Also, how to ignore corrupted footer without getting Spark crashed.

6 REPLIES 6

avatar
Master Guru

Turn off Filter Pushdown

https://issues.apache.org/jira/browse/SPARK-11153

Can you read those files with anything? If so, I would write another copy in HDFS as ORC.

If the file is too corrupt it is lost

Error recovery

If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.

Potential extension: With smaller row groups, the biggest issue is placing the file metadata at the end. If an error happens while writing the file metadata, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recover partially written files.

https://parquet.apache.org/documentation/latest/

avatar
Explorer

@Timothy Spann

Hi Timothy ,

Thanks for the quick response. So, parquet file's footer is corrupted. I am reading multiple files from one directory using sparksql.

In that dir one file's footer is corrupted and so spark crashes. Is there any way to just ignore that corrupted blocks and read other

files as it is? I switched off the filter-pushdown by using sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

Code used to read multiple files. (Here, /data/tempparquetdata/br.1455148800.0 is corrupted )

val newDataDF = sqlContext.read.parquet("/data/tempparquetdata/data1.parquet","/data/tempparquetdata/data2.parquet","/data/tempparquetdata/br.1455148800.0")

newDataDF.show throws the Exception " java.lang.RuntimeException: hdfs://CRUX2-SETUP:9000/data/tempparquetdata/br.1455148800.0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [82, 52, 24, 10]"

avatar
Master Guru

can you copy that file elsewhere and then delete it. You can rebuild the parquet directory?

did you run fsck on that directory?

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#fsck

see here: https://dzone.com/articles/hdfs-cheat-sheet

I wonder if it's because you are trying to read files and not the entire directory.

can you read the entire directory (or copy to another directory) /data/tempparquetdata

https://issues.apache.org/jira/browse/SPARK-3138

avatar
Explorer

Hi,

Tried reading the whole directory also. No luck!

I don't want to delete or move or identify such files. I just want to skip/ignore such files while reading using sparksql.

Thanks

avatar
Master Guru

you can move the bad files out of the directory

avatar
Master Guru

You may need to open a JIRA with spark.apache.org or parquet. Seems an issue in one of them.