Support Questions

rezab · ‎09-12-2015

Hi,

I'm trying to run a Spark application on YARN in a single node instance with 32G RAM. Its working well for a small dataset. But for a bigger table its failing with this error:

Application application_1442094222971_0008 failed 2 times due to AM Container for appattempt_1442094222971_0008_000002 exited with exitCode: 11

For more detailed output, check application tracking page:http://chd.moneyball.guru:8088/proxy/application_1442094222971_0008/Then, click on links to logs of each attempt.

Diagnostics: Exception from container-launch.

Container id: container_1442094222971_0008_02_000001

Exit code: 11

Stack trace: ExitCodeException exitCode=11:

at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)

at org.apache.hadoop.util.Shell.run(Shell.java:455)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)

at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 11

Failing this attempt. Failing the application.

-----------------------------

Here is the stdout of container:

[2015-09-12T20:53:28.368-04:00] [DataProcessing] [WARN] [] [org.apache.spark.Logging$class] [tid:Driver Heartbeater] [userID:yarn] Error sending message [message = Heartbeat(2,[Lscala.Tuple2;@2c3b1696,BlockManagerId(2, chd.moneyball.guru, 60663))] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
[2015-09-12T20:53:32.314-04:00] [DataProcessing] [WARN] [] [org.apache.spark.Logging$class] [tid:Executor task launch worker-0] [userID:yarn] Error sending message [message = GetLocations(rdd_4_1839)] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
   at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:221)
   at org.apache.spark.storage.BlockManagerMaster.getLocations(BlockManagerMaster.scala:70)
   at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:591)
   at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:578)
   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:622)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
[2015-09-12T20:53:42.123-04:00] [DataProcessing] [ERROR] [] [org.apache.spark.Logging$class] [tid:sparkExecutor-akka.actor.default-dispatcher-3] [userID:yarn] Driver Disassociated [akka.tcp://sparkExecutor@chd.moneyball.guru:38443] -> [akka.tcp://sparkDriver@chd.moneyball.guru:43977] disassociated! Shutting down.

Any help?

Thanks

srowen · ‎09-13-2015

This much basically says "the executor stopped for some reason". You'd
have to dig in to the application via YARN, and click through to its
entry in the history server, to browse those logs, and see if you can
find exceptions in the executor log. It sounds like it stopped
responding. As a guess, you might be out of memory and stuck in GC
thrashing.

rezab · ‎09-13-2015

Thanks for your reply.

You are right. I saw this in executor logs:

Exception in thread "qtp1529675476-45" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC overhead limit exceeded

What can I do to fix this?

I'm using Spark on YARN and spark memory allocation is dynamic.

Also my Hive table is around 70G. Does it mean that I need 70G memory for spark to process them?

srowen · ‎09-13-2015

In general it means executors need more memory, but it's a fairly
complex question. Maybe you need smaller tasks so that peak memory
usage is lower. Maybe cache less or use lower max cache level. Or more
executor memory. Maybe at the margins better GC settings.

Usually the place to start is deciding whether your computation is
inherently going to scale badly and run out of memory in a certain
stage.

alrocks · ‎09-14-2015

In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions.

Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data.

Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.

Cloudera Community

Support Questions

Spark not working when I'm using a big dataset