Support Questions

Find answers, ask questions, and share your expertise

Spark's faill durring persist()

avatar
Rising Star

Hi dear experts!

 

i running on my spark cluster (yarn-client mode) follow simple test script:

import org.apache.spark.storage.StorageLevel
val input = sc.textFile("/user/hive/warehouse/tpc_ds_3T/...");
val result = input.coalesce(600).persist(StorageLevel.MEMORY_AND_DISK_SER)
result.count()

 

RDD much higher then memory, but i specify disk option.

in some time i start observing warnings like this:

15/07/08 23:20:38 WARN TaskSetManager: Lost task 33.1 in stage 0.0 (TID 104, host4: ExecutorLostFailure (executor 15 lost)

and finaly i got:

 

15/07/08 23:14:41 INFO BlockManagerMasterActor: Registering block manager SomeHost2:16768 with 2.8 GB RAM, BlockManagerId(58, SomeHost2, 16768)
15/07/08 23:14:43 WARN TaskSetManager: Lost task 41.2 in stage 0.0 (TID 208, scaj43bda03.us.oracle.com): java.io.IOException: Failed to connect to SomeHost2/192.168.42.92:37305
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: SomeHost2/192.168.42.92:37305
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        ... 1 more

honestly i don't know where i can start my debuging...

will appreciate any advice!

 

thanks!

2 ACCEPTED SOLUTIONS

avatar
Explorer
t's possible that you are overwhelming the CPU on the hosts by using StorageLevel.MEMORY_AND_DISK_SER as this is a CPU intensive storage strategy: https://spark.apache.org/docs/1.3.0/programming-guide.html#rdd-persistence Are you able to use deserialized objects instead? Using StorageLevel.MEMORY_AND_DISK will be less CPU intensive.

View solution in original post

avatar
Rising Star

Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.

i just increase it and everything works now

View solution in original post

3 REPLIES 3

avatar
Rising Star

I found something in the YARN logs:

15/07/08 23:24:28 WARN spark.CacheManager: Persisting partition rdd_4_174 to disk instead.
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 170.0 in stage 0.0 (TID 235)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 171.0 in stage 0.0 (TID 236)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 173.0 in stage 0.0 (TID 238)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 174.0 in stage 0.0 (TID 239)
15/07/08 23:24:29 WARN storage.BlockManager: Putting block rdd_4_174 failed
15/07/08 23:24:29 INFO executor.Executor: Executor killed task 174.0 in stage 0.0 (TID 239)
15/07/08 23:24:29 INFO executor.Executor: Executor killed task 173.0 in stage 0.0 (TID 238)
15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255696059) called with curMem=418412, maxMem=2222739947
15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_170 stored as bytes in memory (estimated size 243.9 MB, free 1875.5 MB)
15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_170
15/07/08 23:24:30 INFO executor.Executor: Executor killed task 170.0 in stage 0.0 (TID 235)
15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255621319) called with curMem=256114471, maxMem=2222739947
15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_171 stored as bytes in memory (estimated size 243.8 MB, free 1631.7 MB)
15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_171
15/07/08 23:24:30 INFO executor.Executor: Executor killed task 171.0 in stage 0.0 (TID 236)

But i still have no idea why executor started kill tasks...

avatar
Explorer
t's possible that you are overwhelming the CPU on the hosts by using StorageLevel.MEMORY_AND_DISK_SER as this is a CPU intensive storage strategy: https://spark.apache.org/docs/1.3.0/programming-guide.html#rdd-persistence Are you able to use deserialized objects instead? Using StorageLevel.MEMORY_AND_DISK will be less CPU intensive.

avatar
Rising Star

Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.

i just increase it and everything works now