Created on 07-08-2015 08:31 PM - edited 09-16-2022 02:33 AM
Hi dear experts!
i running on my spark cluster (yarn-client mode) follow simple test script:
import org.apache.spark.storage.StorageLevel val input = sc.textFile("/user/hive/warehouse/tpc_ds_3T/..."); val result = input.coalesce(600).persist(StorageLevel.MEMORY_AND_DISK_SER) result.count()
RDD much higher then memory, but i specify disk option.
in some time i start observing warnings like this:
15/07/08 23:20:38 WARN TaskSetManager: Lost task 33.1 in stage 0.0 (TID 104, host4: ExecutorLostFailure (executor 15 lost)
and finaly i got:
15/07/08 23:14:41 INFO BlockManagerMasterActor: Registering block manager SomeHost2:16768 with 2.8 GB RAM, BlockManagerId(58, SomeHost2, 16768) 15/07/08 23:14:43 WARN TaskSetManager: Lost task 41.2 in stage 0.0 (TID 208, scaj43bda03.us.oracle.com): java.io.IOException: Failed to connect to SomeHost2/192.168.42.92:37305 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: SomeHost2/192.168.42.92:37305 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more
honestly i don't know where i can start my debuging...
will appreciate any advice!
thanks!
Created 07-15-2015 05:50 AM
Created 07-15-2015 06:32 PM
Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.
i just increase it and everything works now
Created 07-08-2015 08:44 PM
I found something in the YARN logs:
15/07/08 23:24:28 WARN spark.CacheManager: Persisting partition rdd_4_174 to disk instead. 15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 170.0 in stage 0.0 (TID 235) 15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 171.0 in stage 0.0 (TID 236) 15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 173.0 in stage 0.0 (TID 238) 15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 174.0 in stage 0.0 (TID 239) 15/07/08 23:24:29 WARN storage.BlockManager: Putting block rdd_4_174 failed 15/07/08 23:24:29 INFO executor.Executor: Executor killed task 174.0 in stage 0.0 (TID 239) 15/07/08 23:24:29 INFO executor.Executor: Executor killed task 173.0 in stage 0.0 (TID 238) 15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255696059) called with curMem=418412, maxMem=2222739947 15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_170 stored as bytes in memory (estimated size 243.9 MB, free 1875.5 MB) 15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_170 15/07/08 23:24:30 INFO executor.Executor: Executor killed task 170.0 in stage 0.0 (TID 235) 15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255621319) called with curMem=256114471, maxMem=2222739947 15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_171 stored as bytes in memory (estimated size 243.8 MB, free 1631.7 MB) 15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_171 15/07/08 23:24:30 INFO executor.Executor: Executor killed task 171.0 in stage 0.0 (TID 236)
But i still have no idea why executor started kill tasks...
Created 07-15-2015 05:50 AM
Created 07-15-2015 06:32 PM
Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.
i just increase it and everything works now