Support Questions

fil · ‎07-08-2015

Hi dear experts!

i running on my spark cluster (yarn-client mode) follow simple test script:

import org.apache.spark.storage.StorageLevel
val input = sc.textFile("/user/hive/warehouse/tpc_ds_3T/...");
val result = input.coalesce(600).persist(StorageLevel.MEMORY_AND_DISK_SER)
result.count()

RDD much higher then memory, but i specify disk option.

in some time i start observing warnings like this:

15/07/08 23:20:38 WARN TaskSetManager: Lost task 33.1 in stage 0.0 (TID 104, host4: ExecutorLostFailure (executor 15 lost)

and finaly i got:

15/07/08 23:14:41 INFO BlockManagerMasterActor: Registering block manager SomeHost2:16768 with 2.8 GB RAM, BlockManagerId(58, SomeHost2, 16768)
15/07/08 23:14:43 WARN TaskSetManager: Lost task 41.2 in stage 0.0 (TID 208, scaj43bda03.us.oracle.com): java.io.IOException: Failed to connect to SomeHost2/192.168.42.92:37305
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: SomeHost2/192.168.42.92:37305
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        ... 1 more

honestly i don't know where i can start my debuging...

will appreciate any advice!

thanks!

dcote · ‎07-15-2015

t's possible that you are overwhelming the CPU on the hosts by using StorageLevel.MEMORY_AND_DISK_SER as this is a CPU intensive storage strategy: https://spark.apache.org/docs/1.3.0/programming-guide.html#rdd-persistence Are you able to use deserialized objects instead? Using StorageLevel.MEMORY_AND_DISK will be less CPU intensive.

View solution in original post

fil · ‎07-15-2015

Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.

i just increase it and everything works now

View solution in original post

fil · ‎07-08-2015

I found something in the YARN logs:

15/07/08 23:24:28 WARN spark.CacheManager: Persisting partition rdd_4_174 to disk instead.
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 170.0 in stage 0.0 (TID 235)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 171.0 in stage 0.0 (TID 236)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 173.0 in stage 0.0 (TID 238)
15/07/08 23:24:29 INFO executor.Executor: Executor is trying to kill task 174.0 in stage 0.0 (TID 239)
15/07/08 23:24:29 WARN storage.BlockManager: Putting block rdd_4_174 failed
15/07/08 23:24:29 INFO executor.Executor: Executor killed task 174.0 in stage 0.0 (TID 239)
15/07/08 23:24:29 INFO executor.Executor: Executor killed task 173.0 in stage 0.0 (TID 238)
15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255696059) called with curMem=418412, maxMem=2222739947
15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_170 stored as bytes in memory (estimated size 243.9 MB, free 1875.5 MB)
15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_170
15/07/08 23:24:30 INFO executor.Executor: Executor killed task 170.0 in stage 0.0 (TID 235)
15/07/08 23:24:30 INFO storage.MemoryStore: ensureFreeSpace(255621319) called with curMem=256114471, maxMem=2222739947
15/07/08 23:24:30 INFO storage.MemoryStore: Block rdd_4_171 stored as bytes in memory (estimated size 243.8 MB, free 1631.7 MB)
15/07/08 23:24:30 INFO storage.BlockManagerMaster: Updated info of block rdd_4_171
15/07/08 23:24:30 INFO executor.Executor: Executor killed task 171.0 in stage 0.0 (TID 236)

But i still have no idea why executor started kill tasks...

dcote · ‎07-15-2015

t's possible that you are overwhelming the CPU on the hosts by using StorageLevel.MEMORY_AND_DISK_SER as this is a CPU intensive storage strategy: https://spark.apache.org/docs/1.3.0/programming-guide.html#rdd-persistence Are you able to use deserialized objects instead? Using StorageLevel.MEMORY_AND_DISK will be less CPU intensive.

fil · ‎07-15-2015

Actually problem was in very agressive caching and overfilling spark.yarn.executor.memoryOverhead buffer and as cosequence OOM error.

i just increase it and everything works now

Cloudera Community

Support Questions

Spark's faill durring persist()

Spark 3 legacy configurations list ( Spark 2 behav...

Spark Python Supportability Matrix

Spark and Java versions Supportability Matrix

Spark Scala Version Compatibility Matrix

Spark Memory Management

Spark Python Integration Test Result Exceptions

Dynamic Allocation in Apache Spark

Apache Spark and Iceberg Supportability Matrix

nifi logback.xml persistance

Spark Streaming Graceful Shutdown - Part1