<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Massive errors on spark shuffle and conneciton reset by peer in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/330304#M230636</link>
    <description>&lt;P&gt;Hello&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please could you help me with the steps to upgrade netty on hdfs/hadoop &amp;amp; spark because I am facing the same issue and don't have the steps to upgrade netty version.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Sukrut&lt;/P&gt;</description>
    <pubDate>Tue, 16 Nov 2021 11:48:14 GMT</pubDate>
    <dc:creator>BigData-suk</dc:creator>
    <dc:date>2021-11-16T11:48:14Z</dc:date>
    <item>
      <title>Massive errors on spark shuffle and conneciton reset by peer</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/176025#M138282</link>
      <description>&lt;P&gt;I am experiencing massive errors on shuffle and connection reset by peer io exception for map/reduce word counting on big dataset.  It worked with small dataset.  I looked around on this forum as well as other places but could not find answer to this problem.  Hopefully, anyone has the solution to this problem can shed some light.  I am using hadoop-2.8.0 and spark-2.1.1 on 152 vcores and 848GB memory (over 12 nodes) Ubuntu-based cluster.  YARN Scheduler Metrics reported memory:1024, vCores:1 as minimum allocation and memory:102400, vCores:32 as maximum allocation.&lt;/P&gt;&lt;P&gt;Following is how I submit the job.&lt;/P&gt;&lt;PRE&gt;spark-submit --master yarn --deploy-mode cluster \
        --queue long \
        --num-executors 4 \
        --driver-cores 6 \
        --executor-cores 4 \
        --driver-memory 6g \
        --executor-memory 4g \
        1gram_count.py
&lt;/PRE&gt;&lt;P&gt;Following are output error and a brief snapshot on job error log.&lt;/P&gt;&lt;P&gt;OUTPUT ERROR:&lt;/P&gt;&lt;PRE&gt;Traceback (most recent call last):
  File "1gram_count.py", line 43, in &amp;lt;module&amp;gt;
    main(sc, filename)
  File "1gram_count.py", line 14, in main
    wordcount = words.reduceByKey(add).collect()
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 808, in collect
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 1 (collect at 1gram_count.py:14) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to send request StreamChunkId{streamId=1317168709388, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:40448: java.io.IOException: Connection reset by peer
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Failed to send request StreamChunkId{streamId=1317168709388, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:40448: java.io.IOException: Connection reset by peer
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:163)
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:147)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
	at io.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621)
	at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
	at org.apache.spark.network.client.TransportClient.fetchChunk(TransportClient.java:146)
	at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:103)
	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492)
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	... 24 more


	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1262)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
&lt;/PRE&gt;&lt;P&gt;ERROR LOG:&lt;/P&gt;&lt;PRE&gt;...
17/05/16 09:54:51 WARN scheduler.TaskSetManager: Lost task 22.0 in stage 1.0 (TID 165, rnode4.mycluster.com, executor 5): FetchFailed(BlockManagerId(6, rnode6.mycluster.com, 43712, None), shuffleId=0, mapId=129, reduceId=22, message=
org.apache.spark.shuffle.FetchFailedException: Connection reset by peer
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492)
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	... 24 more
)
17/05/16 09:54:51 INFO scheduler.TaskSetManager: Task 22.0 in stage 1.0 (TID 165) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed.
...
17/05/16 09:54:51 WARN scheduler.TaskSetManager: Lost task 28.0 in stage 1.0 (TID 171, tnode4.mycluster.com, executor 3): FetchFailed(BlockManagerId(5, rnode4.mycluster.com, 43171, None), shuffleId=0, mapId=40, reduceId=28, message=
org.apache.spark.shuffle.FetchFailedException: Failed to send request StreamChunkId{streamId=607660513078, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:43171: java.io.IOException: Connection reset by peer
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Failed to send request StreamChunkId{streamId=607660513078, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:43171: java.io.IOException: Connection reset by peer
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:163)
	at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:147)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
	at io.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621)
	at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
	at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
	at org.apache.spark.network.client.TransportClient.fetchChunk(TransportClient.java:146)
	at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:103)
	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492)
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	... 24 more
)
17/05/16 09:54:51 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 1 (epoch 1)
17/05/16 09:54:51 INFO scheduler.TaskSetManager: Task 28.0 in stage 1.0 (TID 171) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed.
17/05/16 09:54:51 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 1 (71/143, false)
...
17/05/16 09:55:22 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) finished in 9.098 s
17/05/16 09:55:22 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/05/16 09:55:22 INFO scheduler.DAGScheduler: running: Set()
17/05/16 09:55:22 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
17/05/16 09:55:22 INFO scheduler.DAGScheduler: failed: Set(ShuffleMapStage 0, ResultStage 1)
17/05/16 09:55:22 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) because some of its tasks had failed: 0, 2, 3, 5, 7, 8, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 28, 29, 30, 33, 35, 36, 37, 40, 42, 44, 45, 46, 47, 48, 49, 54, 56, 57, 59, 61, 63, 67, 69, 70, 72, 75, 78, 81, 84, 85, 87, 88, 89, 91, 92, 97, 98, 99, 104, 106, 107, 114, 116, 117, 120, 125, 127, 130, 133, 136, 137, 139, 140, 141, 142
...
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) and ResultStage 1 (collect at 1gram_count.py:14) due to fetch failure
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 8 (epoch 80)
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
17/05/16 09:55:39 INFO scheduler.TaskSetManager: Task 0.0 in stage 1.2 (TID 417) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed.
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(8, cage.mycluster.com, 43545, None)
17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 8 successfully in removeExecutor
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 8 (epoch 80)
17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 8 (133/143, false)
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 80)
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, gardel.mycluster.com, 56871, None)
17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 7 (epoch 80)
17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 7 (118/143, false)
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 80)
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, rnode6.mycluster.com, 43712, None)
17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 6 (epoch 80)
17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 6 (93/143, false)
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Resubmitting failed stages
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at 1gram_count.py:14), which has no missing parents
17/05/16 09:55:39 INFO memory.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 9.1 KB, free 3.0 GB)
17/05/16 09:55:39 INFO memory.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 5.8 KB, free 3.0 GB)
17/05/16 09:55:39 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 13.15.17.113:38097 (size: 5.8 KB, free: 3.0 GB)
17/05/16 09:55:39 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:996
17/05/16 09:55:39 INFO scheduler.DAGScheduler: Submitting 50 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at 1gram_count.py:14)
17/05/16 09:55:39 INFO cluster.YarnClusterScheduler: Adding task set 0.4 with 50 tasks
17/05/16 09:55:41 INFO storage.BlockManagerMasterEndpoint: Registering block manager gardel.mycluster.com:56871 with 2004.6 MB RAM, BlockManagerId(7, gardel.mycluster.com, 56871, None)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on gardel.mycluster.com:56871 (size: 29.0 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO storage.BlockManagerMasterEndpoint: Registering block manager tnode5.mycluster.com:50922 with 2004.6 MB RAM, BlockManagerId(4, tnode5.mycluster.com, 50922, None)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on tnode5.mycluster.com:50922 (size: 29.0 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB)
17/05/16 09:55:42 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.4 (TID 450, rnode1.mycluster.com, executor 1, partition 7, NODE_LOCAL, 6061 bytes)
...
17/05/16 09:56:08 ERROR yarn.ApplicationMaster: User application exited with status 1
17/05/16 09:56:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
...
17/05/16 09:56:08 INFO spark.SparkContext: Invoking stop() from shutdown hook
17/05/16 09:56:08 INFO server.ServerConnector: Stopped Spark@295a0399{HTTP/1.1}{0.0.0.0:0}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4650135{/stages/stage/kill,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7fa661e4{/jobs/job/kill,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@12cbc3d7{/api,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7ed7eff8{/,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@19854e4e{/static,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@d795813{/executors/threadDump/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7e534778{/executors/threadDump,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5555f52c{/executors/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7dfe96ab{/executors,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6966dbd{/environment/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@28097bac{/environment,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@722feb6d{/storage/rdd/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5854bbf{/storage/rdd,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@332fc652{/storage/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@574bd394{/storage,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@71506d9a{/stages/pool/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4daa72b3{/stages/pool,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a71edaa{/stages/stage/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@e89d2da{/stages/stage,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@70e609be{/stages/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5f7ca7b3{/stages,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@77d95d4d{/jobs/job/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@76b92f6f{/jobs/job,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7330862d{/jobs/json,null,UNAVAILABLE,@Spark}
17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@76828613{/jobs,null,UNAVAILABLE,@Spark}
...
17/05/16 09:56:08 INFO ui.SparkUI: Stopped Spark web UI at &lt;A href="http://13.15.17.113:35816" target="_blank"&gt;http://13.15.17.113:35816&lt;/A&gt;
...
17/05/16 09:56:08 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,3,ResultTask,FetchFailed(BlockManagerId(2, tnode3.mycluster.com, 44720, None),0,6,2,org.apache.spark.shuffle.FetchFailedException: Connection from tnode3.mycluster.com/13.15.17.107:44720 closed
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
	at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
	at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: java.io.IOException: Connection from tnode3.mycluster.com/13.15.17.107:44720 closed
	at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
	at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:247)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
	at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:769)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:567)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
),org.apache.spark.scheduler.TaskInfo@4a03abd0,null)
17/05/16 09:56:08 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
17/05/16 09:56:08 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
17/05/16 09:56:08 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
17/05/16 09:56:08 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
17/05/16 09:56:08 ERROR server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
	at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
	at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
	at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
	at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
...
17/05/16 09:56:08 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/05/16 09:56:08 INFO memory.MemoryStore: MemoryStore cleared
17/05/16 09:56:08 INFO storage.BlockManager: BlockManager stopped
17/05/16 09:56:08 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
17/05/16 09:56:08 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/05/16 09:56:08 INFO spark.SparkContext: Successfully stopped SparkContext
17/05/16 09:56:09 INFO util.ShutdownHookManager: Shutdown hook called
17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-11b09437-1b50-4953-a071-56ee1530b00f
17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-11b09437-1b50-4953-a071-56ee1530b00f/pyspark-5a8a2155-4e7a-4099-aaf6-d96bed36c671
17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs2/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-a16dd72a-70b4-4da4-96c3-7c655f9eb23
&lt;/PRE&gt;</description>
      <pubDate>Thu, 18 May 2017 19:42:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/176025#M138282</guid>
      <dc:creator>jhung0077</dc:creator>
      <dc:date>2017-05-18T19:42:56Z</dc:date>
    </item>
    <item>
      <title>Re: Massive errors on spark shuffle and conneciton reset by peer</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/176026#M138283</link>
      <description>&lt;P&gt;Ok, I think I have found the answer to this problem.  It is due to Netty version mismatch.  One more piece of evidence that I found as I dug into each node's log.  Here is the node's error log (which did not show up in YARN log).&lt;/P&gt;&lt;PRE&gt;17/05/17 11:54:07 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=790775716012, chunkIndex=11}, buffer=FileSegmentManagedBuffer{file=/hdfs2/yarn/local/usercache/chiachi/appcache/application_1495046487864_0001/blockmgr-7f6bd1f5-a0c7-43e5-a9a5-3bd2a9d8abbd/19/shuffle_0_54_0.data, offset=5963, length=1246}} to /10.16.0.9:44851; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.&amp;lt;init&amp;gt;(Ljava/io/File;JJ)V
	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:651)
	at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:266)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706)
	at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895)
	at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240)
	at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
	at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.&amp;lt;init&amp;gt;(Ljava/io/File;JJ)V
	at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:133)
	at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
	at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89)
	... 34 more&lt;/PRE&gt;&lt;P&gt;Why there was Netty version discrepancy?  I upgraded both Hadoop and Spark but I left HDFS running (with old Netty version).  After restarted HDFS and YARN, this error was no longer existed.&lt;/P&gt;</description>
      <pubDate>Sat, 20 May 2017 05:48:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/176026#M138283</guid>
      <dc:creator>jhung0077</dc:creator>
      <dc:date>2017-05-20T05:48:37Z</dc:date>
    </item>
    <item>
      <title>Re: Massive errors on spark shuffle and conneciton reset by peer</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/330304#M230636</link>
      <description>&lt;P&gt;Hello&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please could you help me with the steps to upgrade netty on hdfs/hadoop &amp;amp; spark because I am facing the same issue and don't have the steps to upgrade netty version.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Sukrut&lt;/P&gt;</description>
      <pubDate>Tue, 16 Nov 2021 11:48:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/330304#M230636</guid>
      <dc:creator>BigData-suk</dc:creator>
      <dc:date>2021-11-16T11:48:14Z</dc:date>
    </item>
    <item>
      <title>Re: Massive errors on spark shuffle and conneciton reset by peer</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/330389#M230661</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/90704"&gt;@BigData-suk&lt;/a&gt;,&amp;nbsp;as this is an older post, you would have a better chance of receiving a resolution by&lt;A href="“https://community.cloudera.com/t5/forums/postpage/board-id/Questions”" target="_blank"&gt; starting a new thread&lt;/A&gt;. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.&lt;/P&gt;</description>
      <pubDate>Wed, 17 Nov 2021 06:28:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Massive-errors-on-spark-shuffle-and-conneciton-reset-by-peer/m-p/330389#M230661</guid>
      <dc:creator>VidyaSargur</dc:creator>
      <dc:date>2021-11-17T06:28:11Z</dc:date>
    </item>
  </channel>
</rss>

