Reply
Highlighted
New Contributor
Posts: 3
Registered: ‎04-16-2018

Spark submit : Error with memory "java.lang.OutOfMemoryError : GC overhead limit exceeded"

[ Edited ]
Hello,
 
I'm facing a memory exceed issue with one of my spark job.
This script is very simple and composed as follow: a loop made of 5 iterations manipulates few dataframes and joins them into a final dataframe which is returned.
 
The structure of the joins are like follow :
Table1.select(
         F.concat(F.col(col1), F.lit('_'),F.col(col2)).alias('key'),
         col1,
         col2,
         'typecle'
      )\
      .join( table2, 'key', 'left_outer' )\
      .groupBy(col1)\
      .agg(
         F.first( 'typecle' ).alias('typecle'),
         F.count( F.when(F.col(col3).isNotNull(),True)).alias(col31),
         F.count…
         )\
      .filter(' typecle != val1)\
      .select(
         '*',
         F.when(F.col('typecle')== val2, val21)\
            .when(F.col('typecle')== val3, val31)\
            …
         F.lit(id_boucl1).alias(id2)
         )\
      .drop('typecle')\
      .write\
      .insertInto(table3, overwrite=True)
 
One of this dataframe is persisted at the very beginning of the loop, then unpersisted at the end.
The volumetry of the data used are around 6M lines of data * 6
 
I set a Spark configuration like follow :
SparkConf().set("spark.executor.memory", "15g")\
                    .set("spark.executor.cores","3")\
                    .set("spark.driver.memory","15g")\
                    .set("spark.yarn.executor.memoryOverhead", "2g")\
                    .set("spark.sql.hive.verifyPartitionPath", True)\
                    .set("spark.sql.autoBroadcastJoinThreshold","-1")
 
Generally, the 4th first iterations work well but the last one fall into error like (no matter the order of execution of the iterations) :
 
Exception in thread "dispatcher-event-loop-20" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
        at java.lang.StringBuilder.append(StringBuilder.java:202)
        at java.io.ObjectStreamClass.getClassSignature(ObjectStreamClass.java:1550)
        at java.io.ObjectStreamClass.getMethodSignature(ObjectStreamClass.java:1567)
        at java.io.ObjectStreamClass.access$2500(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$MemberSignature.<init>(ObjectStreamClass.java:1892)
        at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1819)
        at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:253)
        at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:251)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:250)
        at java.io.ObjectStreamClass.writeNonProxy(ObjectStreamClass.java:735)
        at java.io.ObjectOutputStream.writeClassDescriptor(ObjectOutputStream.java:668)
        at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1282)
        at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1231)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1427)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
        at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
        at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
        at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:246)
        at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:451)
        at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:431)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:431)
 
I tried to explore Cloudera Manager and find out a parameter which could increase garbage collector or just execute it faster but with no success.
I found one parameter "Client java heap" but it didn't impact the result.
 
By modifying Spark parameters, the job is executed faster but the last iteration still raise the error.
 
 
Did someone faced this problem? 
Do you know any way I can explore to maybe fix this issue?
 
Thank you all for your answers.
Announcements