Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark's faill durring persist()

Solved Go to solution
Highlighted

Spark's faill durring persist()

Rising Star

Hi dear experts!

 

i running on my spark cluster (yarn-client mode) follow simple test script:

import org.apache.spark.storage.StorageLevel
val input = sc.textFile("/user/hive/warehouse/tpc_ds_3T/...");
val result = input.coalesce(600).persist(StorageLevel.MEMORY_AND_DISK_SER)
result.count()

 

RDD much higher then memory, but i specify disk option.

in some time i start observing warnings like this:

15/07/08 23:20:38 WARN TaskSetManager: Lost task 33.1 in stage 0.0 (TID 104, host4: ExecutorLostFailure (executor 15 lost)

and finaly i got:

 

15/07/08 23:14:41 INFO BlockManagerMasterActor: Registering block manager SomeHost2:16768 with 2.8 GB RAM, BlockManagerId(58, SomeHost2, 16768)
15/07/08 23:14:43 WARN TaskSetManager: Lost task 41.2 in stage 0.0 (TID 208, scaj43bda03.us.oracle.com): java.io.IOException: Failed to connect to SomeHost2/192.168.42.92:37305
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: SomeHost2/192.168.42.92:37305
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        ... 1 more

honestly i don't know where i can start my debuging...

will appreciate any advice!

 

thanks!

2 ACCEPTED SOLUTIONS

Accepted Solutions
Highlighted

Re: Spark's faill durring persist()

New Contributor
t's possible that you are overwhelming the CPU on the hosts by using StorageLevel.MEMORY_AND_DISK_SER as this is a CPU intensive storage strategy: https://spark.apache.org/docs/1.3.0/programming-guide.html#rdd-persistence Are you able to use deserialized objects instead? Using StorageLevel.MEMORY_AND_DISK will be less CPU intensive.

View solution in original post

Highlighted

Re: Spark's faill durring persist()

Rising Star

Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.

i just increase it and everything works now

View solution in original post

3 REPLIES 3

Re: Spark's faill durring persist()

Rising Star

I found something in the YARN logs:

15/07/08 23:24:28 WARN spark.CacheManager: Persisting partition rdd_4_174 to disk instead.
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 170.0 in stage 0.0 (TID 235)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 171.0 in stage 0.0 (TID 236)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 173.0 in stage 0.0 (TID 238)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 174.0 in stage 0.0 (TID 239)
15/07/08 23:24:29 WARN storage.BlockManager: Putting block rdd_4_174 failed
15/07/08 23:24:29 INFO executor.Executor: Executor killed task 174.0 in stage 0.0 (TID 239)
15/07/08 23:24:29 INFO executor.Executor: Executor killed task 173.0 in stage 0.0 (TID 238)
15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255696059) called with curMem=418412, maxMem=2222739947
15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_170 stored as bytes in memory (estimated size 243.9 MB, free 1875.5 MB)
15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_170
15/07/08 23:24:30 INFO executor.Executor: Executor killed task 170.0 in stage 0.0 (TID 235)
15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255621319) called with curMem=256114471, maxMem=2222739947
15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_171 stored as bytes in memory (estimated size 243.8 MB, free 1631.7 MB)
15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_171
15/07/08 23:24:30 INFO executor.Executor: Executor killed task 171.0 in stage 0.0 (TID 236)

But i still have no idea why executor started kill tasks...

Highlighted

Re: Spark's faill durring persist()

New Contributor
t's possible that you are overwhelming the CPU on the hosts by using StorageLevel.MEMORY_AND_DISK_SER as this is a CPU intensive storage strategy: https://spark.apache.org/docs/1.3.0/programming-guide.html#rdd-persistence Are you able to use deserialized objects instead? Using StorageLevel.MEMORY_AND_DISK will be less CPU intensive.

View solution in original post

Highlighted

Re: Spark's faill durring persist()

Rising Star

Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.

i just increase it and everything works now

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here