Support Questions

Find answers, ask questions, and share your expertise

Massive errors on spark shuffle and conneciton reset by peer

avatar
Explorer

I am experiencing massive errors on shuffle and connection reset by peer io exception for map/reduce word counting on big dataset. It worked with small dataset. I looked around on this forum as well as other places but could not find answer to this problem. Hopefully, anyone has the solution to this problem can shed some light. I am using hadoop-2.8.0 and spark-2.1.1 on 152 vcores and 848GB memory (over 12 nodes) Ubuntu-based cluster. YARN Scheduler Metrics reported memory:1024, vCores:1 as minimum allocation and memory:102400, vCores:32 as maximum allocation.

Following is how I submit the job.

spark-submit --master yarn --deploy-mode cluster \
        --queue long \
        --num-executors 4 \
        --driver-cores 6 \
        --executor-cores 4 \
        --driver-memory 6g \
        --executor-memory 4g \
        1gram_count.py

Following are output error and a brief snapshot on job error log.

OUTPUT ERROR:

Traceback (most recent call last):
  File "1gram_count.py", line 43, in <module>
    main(sc, filename)
  File "1gram_count.py", line 14, in main
    wordcount = words.reduceByKey(add).collect()
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 808, in collect
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 1 (collect at 1gram_count.py:14) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to send request StreamChunkId{streamId=1317168709388, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:40448: java.io.IOException: Connection reset by peer
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Failed to send request StreamChunkId{streamId=1317168709388, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:40448: java.io.IOException: Connection reset by peer
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:163)
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:147)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
	at io.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621)
	at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
	at org.apache.spark.network.client.TransportClient.fetchChunk(TransportClient.java:146)
	at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:103)
	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492)
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	... 24 more


	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1262)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

ERROR LOG:

...
17/05/16 09:54:51 WARN scheduler.TaskSetManager: Lost task 22.0 in stage 1.0 (TID 165, rnode4.mycluster.com, executor 5): FetchFailed(BlockManagerId(6, rnode6.mycluster.com, 43712, None), shuffleId=0, mapId=129, reduceId=22, message=
org.apache.spark.shuffle.FetchFailedException: Connection reset by peer
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492)
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	... 24 more
)
17/05/16 09:54:51 INFO scheduler.TaskSetManager: Task 22.0 in stage 1.0 (TID 165) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed.
...
17/05/16 09:54:51 WARN scheduler.TaskSetManager: Lost task 28.0 in stage 1.0 (TID 171, tnode4.mycluster.com, executor 3): FetchFailed(BlockManagerId(5, rnode4.mycluster.com, 43171, None), shuffleId=0, mapId=40, reduceId=28, message=
org.apache.spark.shuffle.FetchFailedException: Failed to send request StreamChunkId{streamId=607660513078, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:43171: java.io.IOException: Connection reset by peer
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Failed to send request StreamChunkId{streamId=607660513078, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:43171: java.io.IOException: Connection reset by peer
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:163)
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:147)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
	at io.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621)
	at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
	at org.apache.spark.network.client.TransportClient.fetchChunk(TransportClient.java:146)
	at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:103)
	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492)
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	... 24 more
)
17/05/16 09:54:51 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 1 (epoch 1)
17/05/16 09:54:51 INFO scheduler.TaskSetManager: Task 28.0 in stage 1.0 (TID 171) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed.
17/05/16 09:54:51 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 1 (71/143, false)
...
17/05/16 09:55:22 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) finished in 9.098 s
17/05/16 09:55:22 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/05/16 09:55:22 INFO scheduler.DAGScheduler: running: Set()
17/05/16 09:55:22 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
17/05/16 09:55:22 INFO scheduler.DAGScheduler: failed: Set(ShuffleMapStage 0, ResultStage 1)
17/05/16 09:55:22 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) because some of its tasks had failed: 0, 2, 3, 5, 7, 8, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 28, 29, 30, 33, 35, 36, 37, 40, 42, 44, 45, 46, 47, 48, 49, 54, 56, 57, 59, 61, 63, 67, 69, 70, 72, 75, 78, 81, 84, 85, 87, 88, 89, 91, 92, 97, 98, 99, 104, 106, 107, 114, 116, 117, 120, 125, 127, 130, 133, 136, 137, 139, 140, 141, 142
...
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) and ResultStage 1 (collect at 1gram_count.py:14) due to fetch failure
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 8 (epoch 80)
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
17/05/16 09:55:39 INFO scheduler.TaskSetManager: Task 0.0 in stage 1.2 (TID 417) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed.
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(8, cage.mycluster.com, 43545, None)
17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 8 successfully in removeExecutor
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 8 (epoch 80)
17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 8 (133/143, false)
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 80)
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, gardel.mycluster.com, 56871, None)
17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 7 (epoch 80)
17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 7 (118/143, false)
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 80)
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, rnode6.mycluster.com, 43712, None)
17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 6 (epoch 80)
17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 6 (93/143, false)
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Resubmitting failed stages
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at 1gram_count.py:14), which has no missing parents
17/05/16 09:55:39 INFO memory.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 9.1 KB, free 3.0 GB)
17/05/16 09:55:39 INFO memory.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 5.8 KB, free 3.0 GB)
17/05/16 09:55:39 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 13.15.17.113:38097 (size: 5.8 KB, free: 3.0 GB)
17/05/16 09:55:39 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:996
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Submitting 50 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at 1gram_count.py:14)
17/05/16 09:55:39 INFO cluster.YarnClusterScheduler: Adding task set 0.4 with 50 tasks
17/05/16 09:55:41 INFO storage.BlockManagerMasterEndpoint: Registering block manager gardel.mycluster.com:56871 with 2004.6 MB RAM, BlockManagerId(7, gardel.mycluster.com, 56871, None)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on gardel.mycluster.com:56871 (size: 29.0 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO storage.BlockManagerMasterEndpoint: Registering block manager tnode5.mycluster.com:50922 with 2004.6 MB RAM, BlockManagerId(4, tnode5.mycluster.com, 50922, None)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on tnode5.mycluster.com:50922 (size: 29.0 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.4 (TID 450, rnode1.mycluster.com, executor 1, partition 7, NODE_LOCAL, 6061 bytes)
...
17/05/16 09:56:08 ERROR yarn.ApplicationMaster: User application exited with status 1
17/05/16 09:56:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
...
17/05/16 09:56:08 INFO spark.SparkContext: Invoking stop() from shutdown hook
17/05/16 09:56:08 INFO server.ServerConnector: Stopped Spark@295a0399{HTTP/1.1}{0.0.0.0:0}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4650135{/stages/stage/kill,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7fa661e4{/jobs/job/kill,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@12cbc3d7{/api,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7ed7eff8{/,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@19854e4e{/static,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@d795813{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7e534778{/executors/threadDump,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5555f52c{/executors/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7dfe96ab{/executors,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6966dbd{/environment/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@28097bac{/environment,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@722feb6d{/storage/rdd/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5854bbf{/storage/rdd,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@332fc652{/storage/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@574bd394{/storage,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@71506d9a{/stages/pool/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4daa72b3{/stages/pool,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a71edaa{/stages/stage/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@e89d2da{/stages/stage,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@70e609be{/stages/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5f7ca7b3{/stages,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@77d95d4d{/jobs/job/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@76b92f6f{/jobs/job,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7330862d{/jobs/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@76828613{/jobs,null,UNAVAILABLE,@Spark}
...
17/05/16 09:56:08 INFO ui.SparkUI: Stopped Spark web UI at http://13.15.17.113:35816
...
17/05/16 09:56:08 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,3,ResultTask,FetchFailed(BlockManagerId(2, tnode3.mycluster.com, 44720, None),0,6,2,org.apache.spark.shuffle.FetchFailedException: Connection from tnode3.mycluster.com/13.15.17.107:44720 closed
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Connection from tnode3.mycluster.com/13.15.17.107:44720 closed
	at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
	at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:247)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:769)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:567)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
),org.apache.spark.scheduler.TaskInfo@4a03abd0,null)
17/05/16 09:56:08 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
17/05/16 09:56:08 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
17/05/16 09:56:08 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/05/16 09:56:08 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
17/05/16 09:56:08 ERROR server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
	at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
	at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
	at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
	at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
...
17/05/16 09:56:08 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/05/16 09:56:08 INFO memory.MemoryStore: MemoryStore cleared
17/05/16 09:56:08 INFO storage.BlockManager: BlockManager stopped
17/05/16 09:56:08 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/05/16 09:56:08 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/05/16 09:56:08 INFO spark.SparkContext: Successfully stopped SparkContext
17/05/16 09:56:09 INFO util.ShutdownHookManager: Shutdown hook called
17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-11b09437-1b50-4953-a071-56ee1530b00f
17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-11b09437-1b50-4953-a071-56ee1530b00f/pyspark-5a8a2155-4e7a-4099-aaf6-d96bed36c671
17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs2/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-a16dd72a-70b4-4da4-96c3-7c655f9eb23
1 ACCEPTED SOLUTION

avatar
Explorer

Ok, I think I have found the answer to this problem. It is due to Netty version mismatch. One more piece of evidence that I found as I dug into each node's log. Here is the node's error log (which did not show up in YARN log).

17/05/17 11:54:07 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=790775716012, chunkIndex=11}, buffer=FileSegmentManagedBuffer{file=/hdfs2/yarn/local/usercache/chiachi/appcache/application_1495046487864_0001/blockmgr-7f6bd1f5-a0c7-43e5-a9a5-3bd2a9d8abbd/19/shuffle_0_54_0.data, offset=5963, length=1246}} to /10.16.0.9:44851; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:651)
	at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:266)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
	at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
	at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:133)
	at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
	at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
	... 34 more

Why there was Netty version discrepancy? I upgraded both Hadoop and Spark but I left HDFS running (with old Netty version). After restarted HDFS and YARN, this error was no longer existed.

View solution in original post

3 REPLIES 3

avatar
Explorer

Ok, I think I have found the answer to this problem. It is due to Netty version mismatch. One more piece of evidence that I found as I dug into each node's log. Here is the node's error log (which did not show up in YARN log).

17/05/17 11:54:07 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=790775716012, chunkIndex=11}, buffer=FileSegmentManagedBuffer{file=/hdfs2/yarn/local/usercache/chiachi/appcache/application_1495046487864_0001/blockmgr-7f6bd1f5-a0c7-43e5-a9a5-3bd2a9d8abbd/19/shuffle_0_54_0.data, offset=5963, length=1246}} to /10.16.0.9:44851; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:651)
	at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:266)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
	at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
	at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:133)
	at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
	at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
	... 34 more

Why there was Netty version discrepancy? I upgraded both Hadoop and Spark but I left HDFS running (with old Netty version). After restarted HDFS and YARN, this error was no longer existed.

avatar
Explorer

Hello

 

Please could you help me with the steps to upgrade netty on hdfs/hadoop & spark because I am facing the same issue and don't have the steps to upgrade netty version.

 

Thanks

Sukrut

avatar
Community Manager

@BigData-suk, as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: