Support Questions

khyati_shah · ‎12-29-2016

How to skip the corrupted block in parquet without getting exception.

Also, how to ignore corrupted footer without getting Spark crashed.

TimothySpann · ‎12-29-2016

Turn off Filter Pushdown

https://issues.apache.org/jira/browse/SPARK-11153

Can you read those files with anything? If so, I would write another copy in HDFS as ORC.

If the file is too corrupt it is lost

Error recovery

If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.

Potential extension: With smaller row groups, the biggest issue is placing the file metadata at the end. If an error happens while writing the file metadata, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recover partially written files.

https://parquet.apache.org/documentation/latest/

khyati_shah · ‎12-30-2016

@Timothy Spann

Hi Timothy ,

Thanks for the quick response. So, parquet file's footer is corrupted. I am reading multiple files from one directory using sparksql.

In that dir one file's footer is corrupted and so spark crashes. Is there any way to just ignore that corrupted blocks and read other

files as it is? I switched off the filter-pushdown by using sqlContext.setConf("spark.sql.parquet.filterPushdown","false")

Code used to read multiple files. (Here, /data/tempparquetdata/br.1455148800.0 is corrupted )

val newDataDF = sqlContext.read.parquet("/data/tempparquetdata/data1.parquet","/data/tempparquetdata/data2.parquet","/data/tempparquetdata/br.1455148800.0")

newDataDF.show throws the Exception " java.lang.RuntimeException: hdfs://CRUX2-SETUP:9000/data/tempparquetdata/br.1455148800.0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [82, 52, 24, 10]"

TimothySpann · ‎12-30-2016

can you copy that file elsewhere and then delete it. You can rebuild the parquet directory?

did you run fsck on that directory?

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#fsck

see here: https://dzone.com/articles/hdfs-cheat-sheet

I wonder if it's because you are trying to read files and not the entire directory.

can you read the entire directory (or copy to another directory) /data/tempparquetdata

https://issues.apache.org/jira/browse/SPARK-3138

khyati_shah · ‎01-01-2017

Hi,

Tried reading the whole directory also. No luck!

I don't want to delete or move or identify such files. I just want to skip/ignore such files while reading using sparksql.

Thanks

TimothySpann · ‎01-01-2017

you can move the bad files out of the directory

TimothySpann · ‎01-01-2017

You may need to open a JIRA with spark.apache.org or parquet. Seems an issue in one of them.

Cloudera Community

Support Questions

Spark 1.6.1 - how to skip corrupted parquet blocks

Error recovery