<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Heap error with large avro in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Heap-error-with-large-avro/m-p/157559#M44877</link>
    <description>&lt;P&gt;I'm not sure the exact problem but have a couple of ideas.  When it works in the spark-shell, how are you starting up the session?&lt;/P&gt;</description>
    <pubDate>Tue, 01 Nov 2016 09:58:32 GMT</pubDate>
    <dc:creator>jwiden</dc:creator>
    <dc:date>2016-11-01T09:58:32Z</dc:date>
    <item>
      <title>Spark Heap error with large avro</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Heap-error-with-large-avro/m-p/157558#M44876</link>
      <description>&lt;P&gt;I am executing the following command:&lt;/P&gt;&lt;P&gt;spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main Main.jar --avro-file file_1.avro --master yarn --executor-memory 5g --driver-memory 5g&lt;/P&gt;&lt;P&gt;The file_1.avro file is about 1.5 Gb. But fails with files at 300 Mg as well. &lt;/P&gt;&lt;P&gt;I have tried running this on both HDP with Spark 1.4.1 and Spark 1.6.1 and I get OOM error. Running from spark-shell works fine.&lt;/P&gt;&lt;P&gt;Part of the huge stack trace:&lt;/P&gt;&lt;P&gt;java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
at org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:84)
at org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:352)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:199)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:250)
...&lt;/P&gt;&lt;P&gt; I have compiled this with scala 2.10.6 with the following lines:&lt;/P&gt;&lt;P&gt;sqlContext.read.format("com.databricks.spark.avro").
option("header", "true").load(aPath + iAvroFile);&lt;/P&gt;&lt;P&gt;sqlContext.sql("CREATE TEMPORARY TABLE " + tempTable + " USING com.databricks.spark.avro OPTIONS " +
"(path '" + aPath + iAvroFile + "')");&lt;/P&gt;&lt;P&gt;val counts_query = "SELECT id ID,count(id) " +
"HitCount,'" + fileDate + "' DateHour FROM " + tempTable + " WHERE Format LIKE CONCAT('%','BEST','%') GROUP BY id";&lt;/P&gt;&lt;P&gt;val flight_counts = sqlContext.sql(counts_query);
flight_counts.show()  # OOM&lt;/P&gt;&lt;P&gt;I have tried many options and get cannot get past this. Ex: --master yarn-client --executor-memory 10g --driver-memory 10g --num-executors 4 --executor-cores 4&lt;/P&gt;&lt;P&gt;Any ideas would help to get past this...&lt;/P&gt;</description>
      <pubDate>Mon, 31 Oct 2016 21:30:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Heap-error-with-large-avro/m-p/157558#M44876</guid>
      <dc:creator>mak88</dc:creator>
      <dc:date>2016-10-31T21:30:32Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Heap error with large avro</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Heap-error-with-large-avro/m-p/157559#M44877</link>
      <description>&lt;P&gt;I'm not sure the exact problem but have a couple of ideas.  When it works in the spark-shell, how are you starting up the session?&lt;/P&gt;</description>
      <pubDate>Tue, 01 Nov 2016 09:58:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Heap-error-with-large-avro/m-p/157559#M44877</guid>
      <dc:creator>jwiden</dc:creator>
      <dc:date>2016-11-01T09:58:32Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Heap error with large avro</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Heap-error-with-large-avro/m-p/157560#M44878</link>
      <description>&lt;P&gt;At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now. &lt;/P&gt;&lt;P&gt;Corrrected cmd line: &lt;/P&gt;&lt;P&gt;spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g   Main.jar --avro-file file_1.avro&lt;/P&gt;</description>
      <pubDate>Wed, 02 Nov 2016 01:47:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Heap-error-with-large-avro/m-p/157560#M44878</guid>
      <dc:creator>mak88</dc:creator>
      <dc:date>2016-11-02T01:47:29Z</dc:date>
    </item>
  </channel>
</rss>

