- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Spark not working when I'm using a big dataset
- Labels:
-
Apache Hadoop
-
Apache Spark
-
Apache YARN
Created on 09-12-2015 06:26 PM - edited 09-16-2022 02:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to run a Spark application on YARN in a single node instance with 32G RAM. Its working well for a small dataset. But for a bigger table its failing with this error:
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
[2015-09-12T20:53:32.314-04:00] [DataProcessing] [WARN] [] [org.apache.spark.Logging$class] [tid:Executor task launch worker-0] [userID:yarn] Error sending message [message = GetLocations(rdd_4_1839)] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:221)
at org.apache.spark.storage.BlockManagerMaster.getLocations(BlockManagerMaster.scala:70)
at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:591)
at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:578)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:622)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-09-12T20:53:42.123-04:00] [DataProcessing] [ERROR] [] [org.apache.spark.Logging$class] [tid:sparkExecutor-akka.actor.default-dispatcher-3] [userID:yarn] Driver Disassociated [akka.tcp://sparkExecutor@chd.moneyball.guru:38443] -> [akka.tcp://sparkDriver@chd.moneyball.guru:43977] disassociated! Shutting down.
Created 09-13-2015 12:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
have to dig in to the application via YARN, and click through to its
entry in the history server, to browse those logs, and see if you can
find exceptions in the executor log. It sounds like it stopped
responding. As a guess, you might be out of memory and stuck in GC
thrashing.
Created 09-13-2015 11:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply.
You are right. I saw this in executor logs:
Exception in thread "qtp1529675476-45" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
What can I do to fix this?
I'm using Spark on YARN and spark memory allocation is dynamic.
Also my Hive table is around 70G. Does it mean that I need 70G memory for spark to process them?
Created 09-13-2015 12:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
complex question. Maybe you need smaller tasks so that peak memory
usage is lower. Maybe cache less or use lower max cache level. Or more
executor memory. Maybe at the margins better GC settings.
Usually the place to start is deciding whether your computation is
inherently going to scale badly and run out of memory in a certain
stage.
Created 09-14-2015 09:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions.
Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data.
Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.
