Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Heap error with large avro

Solved Go to solution
Highlighted

Spark Heap error with large avro

Explorer

I am executing the following command:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main Main.jar --avro-file file_1.avro --master yarn --executor-memory 5g --driver-memory 5g

The file_1.avro file is about 1.5 Gb. But fails with files at 300 Mg as well.

I have tried running this on both HDP with Spark 1.4.1 and Spark 1.6.1 and I get OOM error. Running from spark-shell works fine.

Part of the huge stack trace:

java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191) at org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:84) at org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:352) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:199) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:250) ...

I have compiled this with scala 2.10.6 with the following lines:

sqlContext.read.format("com.databricks.spark.avro"). option("header", "true").load(aPath + iAvroFile);

sqlContext.sql("CREATE TEMPORARY TABLE " + tempTable + " USING com.databricks.spark.avro OPTIONS " + "(path '" + aPath + iAvroFile + "')");

val counts_query = "SELECT id ID,count(id) " + "HitCount,'" + fileDate + "' DateHour FROM " + tempTable + " WHERE Format LIKE CONCAT('%','BEST','%') GROUP BY id";

val flight_counts = sqlContext.sql(counts_query); flight_counts.show() # OOM

I have tried many options and get cannot get past this. Ex: --master yarn-client --executor-memory 10g --driver-memory 10g --num-executors 4 --executor-cores 4

Any ideas would help to get past this...

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Spark Heap error with large avro

Explorer

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro

View solution in original post

2 REPLIES 2
Highlighted

Re: Spark Heap error with large avro

Expert Contributor

I'm not sure the exact problem but have a couple of ideas. When it works in the spark-shell, how are you starting up the session?

Highlighted

Re: Spark Heap error with large avro

Explorer

At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now.

Corrrected cmd line:

spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here