Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark Heap error with large avro

avatar
Contributor

I am executing the following command:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main Main.jar --avro-file file_1.avro --master yarn --executor-memory 5g --driver-memory 5g

The file_1.avro file is about 1.5 Gb. But fails with files at 300 Mg as well.

I have tried running this on both HDP with Spark 1.4.1 and Spark 1.6.1 and I get OOM error. Running from spark-shell works fine.

Part of the huge stack trace:

java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) at org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:84) at org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:352) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:199) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:250) ...

I have compiled this with scala 2.10.6 with the following lines:

sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load(aPath + iAvroFile);

sqlContext.sql("CREATE TEMPORARY TABLE " + tempTable + " USING com.databricks.spark.avro OPTIONS " + "(path '" + aPath + iAvroFile + "')");

val counts_query = "SELECT id ID,count(id) " + "HitCount,'" + fileDate + "' DateHour FROM " + tempTable + " WHERE Format LIKE CONCAT('%','BEST','%') GROUP BY id";

val flight_counts = sqlContext.sql(counts_query); flight_counts.show() # OOM

I have tried many options and get cannot get past this. Ex: --master yarn-client --executor-memory 10g --driver-memory 10g --num-executors 4 --executor-cores 4

Any ideas would help to get past this...

1 ACCEPTED SOLUTION

avatar
Contributor

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

I'm not sure the exact problem but have a couple of ideas. When it works in the spark-shell, how are you starting up the session?

avatar
Contributor

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro