- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark 1.6.1 - how to skip corrupted parquet blocks
- Labels:
-
Apache Spark
Created 12-29-2016 07:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How to skip the corrupted block in parquet without getting exception.
Also, how to ignore corrupted footer without getting Spark crashed.
Created 12-29-2016 08:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Turn off Filter Pushdown
https://issues.apache.org/jira/browse/SPARK-11153
Can you read those files with anything? If so, I would write another copy in HDFS as ORC.
If the file is too corrupt it is lost
Error recovery
If the file metadata is corrupt, the file is lost. If the column metdata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.
Potential extension: With smaller row groups, the biggest issue is placing the file metadata at the end. If an error happens while writing the file metadata, all the data written will be unreadable. This can be fixed by writing the file metadata every Nth row group. Each file metadata would be cumulative and include all the row groups written so far. Combining this with the strategy used for rc or avro files using sync markers, a reader could recover partially written files.
Created 12-30-2016 02:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Timothy ,
Thanks for the quick response. So, parquet file's footer is corrupted. I am reading multiple files from one directory using sparksql.
In that dir one file's footer is corrupted and so spark crashes. Is there any way to just ignore that corrupted blocks and read other
files as it is? I switched off the filter-pushdown by using sqlContext.setConf("spark.sql.parquet.filterPushdown","false")
Code used to read multiple files. (Here, /data/tempparquetdata/br.1455148800.0 is corrupted )
val newDataDF = sqlContext.read.parquet("/data/tempparquetdata/data1.parquet","/data/tempparquetdata/data2.parquet","/data/tempparquetdata/br.1455148800.0")
newDataDF.show throws the Exception " java.lang.RuntimeException: hdfs://CRUX2-SETUP:9000/data/tempparquetdata/br.1455148800.0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [82, 52, 24, 10]"
Created 12-30-2016 03:23 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
can you copy that file elsewhere and then delete it. You can rebuild the parquet directory?
did you run fsck on that directory?
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#fsck
see here: https://dzone.com/articles/hdfs-cheat-sheet
I wonder if it's because you are trying to read files and not the entire directory.
can you read the entire directory (or copy to another directory) /data/tempparquetdata
Created 01-01-2017 04:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Tried reading the whole directory also. No luck!
I don't want to delete or move or identify such files. I just want to skip/ignore such files while reading using sparksql.
Thanks
Created 01-01-2017 03:22 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you can move the bad files out of the directory
Created 01-01-2017 03:29 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may need to open a JIRA with spark.apache.org or parquet. Seems an issue in one of them.
