Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Number of tasks on executors become negative after executor failures

Highlighted

Number of tasks on executors become negative after executor failures

Explorer

Summary:

I am running Spark 1.5 on CDH5.5.1.  Under extreme load intermittently I am getting this connection failure exception and later negative executor in the Spark UI.

 

Exception:

TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi, callTime: 76ms

INFO : org.apache.spark.network.client.TransportClientFactory - Found inactive connection to xxxx/xxx.xxx.xxx.xxxx, creating a new one.

ERROR: org.apache.spark.network.shuffle.RetryingBlockFetcher - Exception while beginning fetch of 1 outstanding blocks (after 1 retries)

java.io.IOException: Failed to connect to xxxx/xxx.xxx.xxx.xxxx

                at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)

                at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)

                at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)

                at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)

                at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)

                at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)

                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

                at java.util.concurrent.FutureTask.run(FutureTask.java:262)

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

                at java.lang.Thread.run(Thread.java:745)

Caused by: java.net.ConnectException: Connection refused: xxxx/xxx.xxx.xxx.xxxx

                at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

                at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)

                at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)

                at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)

                at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

                at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

                ... 1 more

 

 

Related Defects:

https://issues.apache.org/jira/browse/SPARK-2319

https://issues.apache.org/jira/browse/SPARK-9591

 

 

 

SparkNegativeActiveProcess.jpg