Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Spark Heap error with large avro

Explorer

I am executing the following command:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main Main.jar --avro-file file_1.avro --master yarn --executor-memory 5g --driver-memory 5g

The file_1.avro file is about 1.5 Gb. But fails with files at 300 Mg as well.

I have tried running this on both HDP with Spark 1.4.1 and Spark 1.6.1 and I get OOM error. Running from spark-shell works fine.

Part of the huge stack trace:

java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) at org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:84) at org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:352) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:199) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:250) ...

I have compiled this with scala 2.10.6 with the following lines:

sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load(aPath + iAvroFile);

sqlContext.sql("CREATE TEMPORARY TABLE " + tempTable + " USING com.databricks.spark.avro OPTIONS " + "(path '" + aPath + iAvroFile + "')");

val counts_query = "SELECT id ID,count(id) " + "HitCount,'" + fileDate + "' DateHour FROM " + tempTable + " WHERE Format LIKE CONCAT('%','BEST','%') GROUP BY id";

val flight_counts = sqlContext.sql(counts_query); flight_counts.show() # OOM

I have tried many options and get cannot get past this. Ex: --master yarn-client --executor-memory 10g --driver-memory 10g --num-executors 4 --executor-cores 4

Any ideas would help to get past this...

1 ACCEPTED SOLUTION

Explorer

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro

View solution in original post

2 REPLIES 2

Expert Contributor

I'm not sure the exact problem but have a couple of ideas. When it works in the spark-shell, how are you starting up the session?

Explorer

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.