<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark 1.6.1 - how to skip corrupted parquet blocks in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-1-6-1-how-to-skip-corrupted-parquet-blocks/m-p/103536#M66453</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/9304/tspann.html" nodeid="9304"&gt;@Timothy Spann&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Hi Timothy , &lt;/P&gt;&lt;P&gt;Thanks for the quick response. So, parquet file's footer is corrupted. I am reading multiple files from one directory using sparksql.&lt;/P&gt;&lt;P&gt;In that dir one file's footer is corrupted and so spark crashes. Is there any way to just ignore that corrupted blocks and read other&lt;/P&gt;&lt;P&gt; files as it is? I switched off the filter-pushdown by using sqlContext.setConf("spark.sql.parquet.filterPushdown","false")&lt;/P&gt;&lt;P&gt;Code used to read multiple files. (Here, /data/tempparquetdata/br.1455148800.0 is corrupted )&lt;/P&gt;&lt;P&gt;val newDataDF = sqlContext.read.parquet("/data/tempparquetdata/data1.parquet","/data/tempparquetdata/data2.parquet","/data/tempparquetdata/br.1455148800.0")&lt;/P&gt;&lt;P&gt;newDataDF.show throws the Exception " java.lang.RuntimeException: hdfs://CRUX2-SETUP:9000/data/tempparquetdata/br.1455148800.0 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [82, 52, 24, 10]"&lt;/P&gt;</description>
    <pubDate>Fri, 30 Dec 2016 10:01:46 GMT</pubDate>
    <dc:creator>khyati_shah</dc:creator>
    <dc:date>2016-12-30T10:01:46Z</dc:date>
  </channel>
</rss>

