Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

New Contributor

screen_shot_2016-07-11_at_9.04.01_am.png

Hi all,

 

So I have a long running job that performs a large series of fast tree-aggregations (every tree-agg taking between a few milliseconds and 1 minute). I am running into an issue where once a day the system crashes and burns and suddenly stops accepting new jobs.
 
screen_shot_2016-07-11_at_9.04.01_am.png
 
I checked my logs and found the following error:
 
16/07/11 15:46:28 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 23791.0 (TID 2381963, hdp01wn0003.p3.usw2.origami42.com): java.lang.OutOfMemoryError: Metaspace
 
 
 
 
I've also seen situations where there is some issue with the torrentBroadcasting, but cant seem to find or reproduce it right now.
 
Is there any known issue with tree-aggregation causing a memory leak? I was doing this same functionality with accumulators before and was not running into this error. (though I also changed # of partitions from 72 to 256, I am testing this currently).
 
I am also running on spark 1.3.1
 
Thank you.
6 REPLIES 6

Re: Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

Master Collaborator

If you see a "Metaspace" error then you are running Java 8, and this should only happen if you have limited the metaspace size in JVM options. You shouldn't limit it in general. I don't think this indicates a memory leak.

Re: Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

New Contributor

Hi @srowen,

 

So I just checked our spark-submit and I don't see anything that would limit the size of the Metaspace:

 

Here are the arguments used for the spark-submit:

 

--conf spark.executor.extraClassPath=/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-class-path /etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --num-executors 20 --driver-memory 20g --executor-memory 12g --executor-cores 2

 

 

And here are the manual setting used to generate the spark context:

 

    val sparkContext = {
           val sparkConf = new SparkConf()
          .setAppName(appName.getOrElse(jobName))
          .set("spark.driver.allowMultipleContexts", "true")
      if (!sparkConf.contains("spark.master")) {
        log.info(s"setting ${EnvConfig.sparkMaster} as spark master")
        log.info("SPARK_HOME " + System.getenv("SPARK_HOME"))
        sparkConf.setMaster(EnvConfig.sparkMaster)
      }
      Option(System.getenv("SPARK_HOME")).foreach(sparkConf.setSparkHome)
      sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      sparkConf.set("spark.kryoserializer.buffer.max.mb","2000")
      sparkConf.set("spark.scheduler.mode", "FAIR")
      sparkConf.set("spark.yarn.am.memory", "14g")
      sparkConf.set("spark.yarn.am.cores", "4")
      sparkConf.set("spark.driver.extraJavaOptions","-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode")
      sparkConf.set("spark.executor.extraJavaOptions","-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode")
      sparkConf.set("spark.storage.memoryFraction", "0.4")
      sparkConf.set("spark.yarn.executor.memoryOverhead", "4000")
      sparkConf.set("spark.driver.maxResultSize", "3g")
      sparkConf.set("spark.locality.wait.process","120000")
      sparkConf.set("spark.driver.maxResultSize", "1536m")

      new SparkContext(sparkConf)
    }

Is there anything here that might cause this limiting?

Re: Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

Master Collaborator

Check your Spark defaults conf file if any? or look at your Environment tab and see what the actual final args are to your executors.

 

You might run "java -XX:+PrintFlagsFinal | grep Metaspace" to see your JVM's defaults. I had assumed it's unlimited by default (very very large value) but maybe not true on all JVMs. Run this using the GC settings you're using to see if, somehow, that causes the metaspace defaults to change.

 

For me, for OS X Java 8 (Oracle):

 

 

    uintx InitialBootClassLoaderMetaspaceSize       = 4194304                             {product}
    uintx MaxMetaspaceExpansion                     = 5451776                             {product}
    uintx MaxMetaspaceFreeRatio                     = 70                                  {product}
    uintx MaxMetaspaceSize                          = 18446744073709547520                    {product}
    uintx MetaspaceSize                             = 21807104                            {pd product}
    uintx MinMetaspaceExpansion                     = 339968                              {product}
    uintx MinMetaspaceFreeRatio                     = 40                                  {product}
     bool UseLargePagesInMetaspace                  = false                               {product}

 

Re: Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

New Contributor

Thank you @srowen, I will try that this (hopefully) this afternoon and try to get back ASAP.

 

One other question: I'd say that half of the time my jobs are dying due to this driver issue, but they're also dying because my executors are dying with a 

 

java.io.IOException: Failed to connect to <address>

error. My team THINKS this might have something to do with TorrentBroadcast vs. HttpBroadcast. Have you seen anything like this?

 

Thank you again.

Re: Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

New Contributor

Hi @srowen,

 

So I checked the executor and driver JVM setting and they appear to be the same as yours:

 

 

 

Metaspace
    uintx InitialBootClassLoaderMetaspaceSize       = 4194304         {product}
    uintx MaxMetaspaceExpansion                     = 5451776         {product}
    uintx MaxMetaspaceFreeRatio                     = 70              {product}
    uintx MaxMetaspaceSize                          = 18446744073709547520{product}
    uintx MetaspaceSize                             = 21807104        {pd product}
    uintx MinMetaspaceExpansion                     = 339968          {product}
    uintx MinMetaspaceFreeRatio                     = 40              {product}
     bool UseLargePagesInMetaspace                  = false           {product}

 

I also checked the /etc/spark/conf/spark-defaults.conf in my executors and didn't find anything that would affect memory.

 

It's worth noting that in the sparkContextGenerator class we set the maxResultSize and memoryOverhead, so I'm not sure if either of those might be the source of this problem.

Highlighted

Re: Long-running tree-aggregation causes java.lang.OutOfMemoryError: Metaspace

Master Collaborator

So it can't be that you're actually hitting a limit then; you'd have to leak through petabytes of memory. It's possible this means something like "couldn't allocate more memory in the metaspace", which would just mean that you did run out of host OS memory. AFAIK metaspace now has class definitions, and sure a Spark program defines lots of classes, but not sure it's so excessive that it runs out of memory. It might just be that you're generally running out of memory and this is where it manifested. You may need a smaller heap size; I don't think the JVM would GC before allocating and failing in the metaspace.

 

maxResultSize just limits how much data comes back to the driver. It's not a cause. memoryOverhead is a YARN setting that exists to accommodate the difference between heap size and total JVM memory -- which includes the metaspace -- but, if that were the limit you'd find YARN killed your job.

 

The other "couldn't connect" error could be lots of things, possibly related to your JVM failing for memory.