Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark Heap error with large avro

avatar
Contributor

I am executing the following command:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main Main.jar --avro-file file_1.avro --master yarn --executor-memory 5g --driver-memory 5g

The file_1.avro file is about 1.5 Gb. But fails with files at 300 Mg as well.

I have tried running this on both HDP with Spark 1.4.1 and Spark 1.6.1 and I get OOM error. Running from spark-shell works fine.

Part of the huge stack trace:

java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) at org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:84) at org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:352) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:199) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:250) ...

I have compiled this with scala 2.10.6 with the following lines:

sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load(aPath + iAvroFile);

sqlContext.sql("CREATE TEMPORARY TABLE " + tempTable + " USING com.databricks.spark.avro OPTIONS " + "(path '" + aPath + iAvroFile + "')");

val counts_query = "SELECT id ID,count(id) " + "HitCount,'" + fileDate + "' DateHour FROM " + tempTable + " WHERE Format LIKE CONCAT('%','BEST','%') GROUP BY id";

val flight_counts = sqlContext.sql(counts_query); flight_counts.show() # OOM

I have tried many options and get cannot get past this. Ex: --master yarn-client --executor-memory 10g --driver-memory 10g --num-executors 4 --executor-cores 4

Any ideas would help to get past this...

1 ACCEPTED SOLUTION

avatar
Contributor

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

I'm not sure the exact problem but have a couple of ideas. When it works in the spark-shell, how are you starting up the session?

avatar
Contributor

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro