Created 12-29-2016 07:56 AM
How to skip the corrupted block in parquet without getting exception.
Also, how to ignore corrupted footer without getting Spark crashed.
Created 12-29-2016 08:04 PM
Turn off Filter Pushdown
https://issues.apache.org/jira/browse/SPARK-11153
Can you read those files with anything? If so, I would write another copy in HDFS as ORC.
If the file is too corrupt it is lost
If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.
Potential extension: With smaller row groups, the biggest issue is placing the file metadata at the end. If an error happens while writing the file metadata, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recover partially written files.
Created 12-30-2016 02:01 AM
Hi Timothy ,
Thanks for the quick response. So, parquet file's footer is corrupted. I am reading multiple files from one directory using sparksql.
In that dir one file's footer is corrupted and so spark crashes. Is there any way to just ignore that corrupted blocks and read other
files as it is? I switched off the filter-pushdown by using sqlContext.setConf("spark.sql.parquet.filterPushdown","false")
Code used to read multiple files. (Here, /data/tempparquetdata/br.1455148800.0 is corrupted )
val newDataDF = sqlContext.read.parquet("/data/tempparquetdata/data1.parquet","/data/tempparquetdata/data2.parquet","/data/tempparquetdata/br.1455148800.0")
newDataDF.show throws the Exception " java.lang.RuntimeException: hdfs://CRUX2-SETUP:9000/data/tempparquetdata/br.1455148800.0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [82, 52, 24, 10]"
Created 12-30-2016 03:23 AM
can you copy that file elsewhere and then delete it. You can rebuild the parquet directory?
did you run fsck on that directory?
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#fsck
see here: https://dzone.com/articles/hdfs-cheat-sheet
I wonder if it's because you are trying to read files and not the entire directory.
can you read the entire directory (or copy to another directory) /data/tempparquetdata
Created 01-01-2017 04:29 AM
Hi,
Tried reading the whole directory also. No luck!
I don't want to delete or move or identify such files. I just want to skip/ignore such files while reading using sparksql.
Thanks
Created 01-01-2017 03:22 PM
you can move the bad files out of the directory
Created 01-01-2017 03:29 PM
You may need to open a JIRA with spark.apache.org or parquet. Seems an issue in one of them.