Created 05-18-2017 12:42 PM
I am experiencing massive errors on shuffle and connection reset by peer io exception for map/reduce word counting on big dataset. It worked with small dataset. I looked around on this forum as well as other places but could not find answer to this problem. Hopefully, anyone has the solution to this problem can shed some light. I am using hadoop-2.8.0 and spark-2.1.1 on 152 vcores and 848GB memory (over 12 nodes) Ubuntu-based cluster. YARN Scheduler Metrics reported memory:1024, vCores:1 as minimum allocation and memory:102400, vCores:32 as maximum allocation.
Following is how I submit the job.
spark-submit --master yarn --deploy-mode cluster \ --queue long \ --num-executors 4 \ --driver-cores 6 \ --executor-cores 4 \ --driver-memory 6g \ --executor-memory 4g \ 1gram_count.py
Following are output error and a brief snapshot on job error log.
OUTPUT ERROR:
Traceback (most recent call last): File "1gram_count.py", line 43, in <module> main(sc, filename) File "1gram_count.py", line 14, in main wordcount = words.reduceByKey(add).collect() File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 808, in collect File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: ResultStage 1 (collect at 1gram_count.py:14) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to send request StreamChunkId{streamId=1317168709388, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:40448: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: java.io.IOException: Failed to send request StreamChunkId{streamId=1317168709388, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:40448: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:163) at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:147) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.fetchChunk(TransportClient.java:146) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:103) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669) at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240) ... 24 more at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1262) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.collect(RDD.scala:935) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)
ERROR LOG:
... 17/05/16 09:54:51 WARN scheduler.TaskSetManager: Lost task 22.0 in stage 1.0 (TID 165, rnode4.mycluster.com, executor 5): FetchFailed(BlockManagerId(6, rnode6.mycluster.com, 43712, None), shuffleId=0, mapId=129, reduceId=22, message= org.apache.spark.shuffle.FetchFailedException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669) at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240) ... 24 more ) 17/05/16 09:54:51 INFO scheduler.TaskSetManager: Task 22.0 in stage 1.0 (TID 165) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed. ... 17/05/16 09:54:51 WARN scheduler.TaskSetManager: Lost task 28.0 in stage 1.0 (TID 171, tnode4.mycluster.com, executor 3): FetchFailed(BlockManagerId(5, rnode4.mycluster.com, 43171, None), shuffleId=0, mapId=40, reduceId=28, message= org.apache.spark.shuffle.FetchFailedException: Failed to send request StreamChunkId{streamId=607660513078, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:43171: java.io.IOException: Connection reset by peer at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: java.io.IOException: Failed to send request StreamChunkId{streamId=607660513078, chunkIndex=6} to rnode4.mycluster.com/13.15.17.96:43171: java.io.IOException: Connection reset by peer at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:163) at org.apache.spark.network.client.TransportClient$1.operationComplete(TransportClient.java:147) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) at io.netty.util.concurrent.DefaultPromise.notifyLateListener(DefaultPromise.java:621) at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:138) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93) at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28) at org.apache.spark.network.client.TransportClient.fetchChunk(TransportClient.java:146) at org.apache.spark.network.shuffle.OneForOneBlockFetcher$1.onSuccess(OneForOneBlockFetcher.java:103) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:176) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:492) at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:270) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:707) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:315) at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:676) at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1059) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669) at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:669) at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:688) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:718) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240) ... 24 more ) 17/05/16 09:54:51 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 1 (epoch 1) 17/05/16 09:54:51 INFO scheduler.TaskSetManager: Task 28.0 in stage 1.0 (TID 171) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed. 17/05/16 09:54:51 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 1 (71/143, false) ... 17/05/16 09:55:22 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) finished in 9.098 s 17/05/16 09:55:22 INFO scheduler.DAGScheduler: looking for newly runnable stages 17/05/16 09:55:22 INFO scheduler.DAGScheduler: running: Set() 17/05/16 09:55:22 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1) 17/05/16 09:55:22 INFO scheduler.DAGScheduler: failed: Set(ShuffleMapStage 0, ResultStage 1) 17/05/16 09:55:22 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) because some of its tasks had failed: 0, 2, 3, 5, 7, 8, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 28, 29, 30, 33, 35, 36, 37, 40, 42, 44, 45, 46, 47, 48, 49, 54, 56, 57, 59, 61, 63, 67, 69, 70, 72, 75, 78, 81, 84, 85, 87, 88, 89, 91, 92, 97, 98, 99, 104, 106, 107, 114, 116, 117, 120, 125, 127, 130, 133, 136, 137, 139, 140, 141, 142 ... 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 0 (reduceByKey at 1gram_count.py:14) and ResultStage 1 (collect at 1gram_count.py:14) due to fetch failure 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 8 (epoch 80) 17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster. 17/05/16 09:55:39 INFO scheduler.TaskSetManager: Task 0.0 in stage 1.2 (TID 417) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed. 17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(8, cage.mycluster.com, 43545, None) 17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 8 successfully in removeExecutor 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 8 (epoch 80) 17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 8 (133/143, false) 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 7 (epoch 80) 17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster. 17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(7, gardel.mycluster.com, 56871, None) 17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 7 successfully in removeExecutor 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 7 (epoch 80) 17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 7 (118/143, false) 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 80) 17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster. 17/05/16 09:55:39 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, rnode6.mycluster.com, 43712, None) 17/05/16 09:55:39 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 6 (epoch 80) 17/05/16 09:55:39 INFO scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 6 (93/143, false) 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Resubmitting failed stages 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at 1gram_count.py:14), which has no missing parents 17/05/16 09:55:39 INFO memory.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 9.1 KB, free 3.0 GB) 17/05/16 09:55:39 INFO memory.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 5.8 KB, free 3.0 GB) 17/05/16 09:55:39 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 13.15.17.113:38097 (size: 5.8 KB, free: 3.0 GB) 17/05/16 09:55:39 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:996 17/05/16 09:55:39 INFO scheduler.DAGScheduler: Submitting 50 missing tasks from ShuffleMapStage 0 (PairwiseRDD[3] at reduceByKey at 1gram_count.py:14) 17/05/16 09:55:39 INFO cluster.YarnClusterScheduler: Adding task set 0.4 with 50 tasks 17/05/16 09:55:41 INFO storage.BlockManagerMasterEndpoint: Registering block manager gardel.mycluster.com:56871 with 2004.6 MB RAM, BlockManagerId(7, gardel.mycluster.com, 56871, None) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on gardel.mycluster.com:56871 (size: 29.0 KB, free: 2004.6 MB) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.6 MB) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on gardel.mycluster.com:56871 (size: 3.7 KB, free: 2004.6 MB) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB) 17/05/16 09:55:41 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on gardel.mycluster.com:56871 (size: 5.8 KB, free: 2004.5 MB) 17/05/16 09:55:42 INFO storage.BlockManagerMasterEndpoint: Registering block manager tnode5.mycluster.com:50922 with 2004.6 MB RAM, BlockManagerId(4, tnode5.mycluster.com, 50922, None) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on tnode5.mycluster.com:50922 (size: 29.0 KB, free: 2004.6 MB) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.6 MB) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on tnode5.mycluster.com:50922 (size: 3.7 KB, free: 2004.6 MB) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB) 17/05/16 09:55:42 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on tnode5.mycluster.com:50922 (size: 5.8 KB, free: 2004.5 MB) 17/05/16 09:55:42 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.4 (TID 450, rnode1.mycluster.com, executor 1, partition 7, NODE_LOCAL, 6061 bytes) ... 17/05/16 09:56:08 ERROR yarn.ApplicationMaster: User application exited with status 1 17/05/16 09:56:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1) ... 17/05/16 09:56:08 INFO spark.SparkContext: Invoking stop() from shutdown hook 17/05/16 09:56:08 INFO server.ServerConnector: Stopped Spark@295a0399{HTTP/1.1}{0.0.0.0:0} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4650135{/stages/stage/kill,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7fa661e4{/jobs/job/kill,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@12cbc3d7{/api,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7ed7eff8{/,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@19854e4e{/static,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@d795813{/executors/threadDump/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7e534778{/executors/threadDump,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5555f52c{/executors/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7dfe96ab{/executors,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@6966dbd{/environment/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@28097bac{/environment,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@722feb6d{/storage/rdd/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5854bbf{/storage/rdd,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@332fc652{/storage/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@574bd394{/storage,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@71506d9a{/stages/pool/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4daa72b3{/stages/pool,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7a71edaa{/stages/stage/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@e89d2da{/stages/stage,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@70e609be{/stages/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@5f7ca7b3{/stages,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@77d95d4d{/jobs/job/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@76b92f6f{/jobs/job,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@7330862d{/jobs/json,null,UNAVAILABLE,@Spark} 17/05/16 09:56:08 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@76828613{/jobs,null,UNAVAILABLE,@Spark} ... 17/05/16 09:56:08 INFO ui.SparkUI: Stopped Spark web UI at http://13.15.17.113:35816 ... 17/05/16 09:56:08 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,3,ResultTask,FetchFailed(BlockManagerId(2, tnode3.mycluster.com, 44720, None),0,6,2,org.apache.spark.shuffle.FetchFailedException: Connection from tnode3.mycluster.com/13.15.17.107:44720 closed at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:361) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:336) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Caused by: java.io.IOException: Connection from tnode3.mycluster.com/13.15.17.107:44720 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128) at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:247) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:769) at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:567) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) ),org.apache.spark.scheduler.TaskInfo@4a03abd0,null) 17/05/16 09:56:08 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s). 17/05/16 09:56:08 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors 17/05/16 09:56:08 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 17/05/16 09:56:08 INFO cluster.SchedulerExtensionServices: Stopping SchedulerExtensionServices (serviceOption=None, services=List(), started=false) 17/05/16 09:56:08 ERROR server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) ... 17/05/16 09:56:08 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/05/16 09:56:08 INFO memory.MemoryStore: MemoryStore cleared 17/05/16 09:56:08 INFO storage.BlockManager: BlockManager stopped 17/05/16 09:56:08 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 17/05/16 09:56:08 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/05/16 09:56:08 INFO spark.SparkContext: Successfully stopped SparkContext 17/05/16 09:56:09 INFO util.ShutdownHookManager: Shutdown hook called 17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-11b09437-1b50-4953-a071-56ee1530b00f 17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-11b09437-1b50-4953-a071-56ee1530b00f/pyspark-5a8a2155-4e7a-4099-aaf6-d96bed36c671 17/05/16 09:56:09 INFO util.ShutdownHookManager: Deleting directory /hdfs2/yarn/local/usercache/jacky/appcache/application_1494630332824_0048/spark-a16dd72a-70b4-4da4-96c3-7c655f9eb23
Created 05-19-2017 10:48 PM
Ok, I think I have found the answer to this problem. It is due to Netty version mismatch. One more piece of evidence that I found as I dug into each node's log. Here is the node's error log (which did not show up in YARN log).
17/05/17 11:54:07 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=790775716012, chunkIndex=11}, buffer=FileSegmentManagedBuffer{file=/hdfs2/yarn/local/usercache/chiachi/appcache/application_1495046487864_0001/blockmgr-7f6bd1f5-a0c7-43e5-a9a5-3bd2a9d8abbd/19/shuffle_0_54_0.data, offset=5963, length=1246}} to /10.16.0.9:44851; closing connection io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:651) at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240) at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:133) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89) ... 34 more
Why there was Netty version discrepancy? I upgraded both Hadoop and Spark but I left HDFS running (with old Netty version). After restarted HDFS and YARN, this error was no longer existed.
Created 05-19-2017 10:48 PM
Ok, I think I have found the answer to this problem. It is due to Netty version mismatch. One more piece of evidence that I found as I dug into each node's log. Here is the node's error log (which did not show up in YARN log).
17/05/17 11:54:07 ERROR server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=790775716012, chunkIndex=11}, buffer=FileSegmentManagedBuffer{file=/hdfs2/yarn/local/usercache/chiachi/appcache/application_1495046487864_0001/blockmgr-7f6bd1f5-a0c7-43e5-a9a5-3bd2a9d8abbd/19/shuffle_0_54_0.data, offset=5963, length=1246}} to /10.16.0.9:44851; closing connection io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:107) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:651) at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:658) at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:716) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:706) at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:741) at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:895) at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:240) at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:133) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89) ... 34 more
Why there was Netty version discrepancy? I upgraded both Hadoop and Spark but I left HDFS running (with old Netty version). After restarted HDFS and YARN, this error was no longer existed.
Created 11-16-2021 03:48 AM
Hello
Please could you help me with the steps to upgrade netty on hdfs/hadoop & spark because I am facing the same issue and don't have the steps to upgrade netty version.
Thanks
Sukrut
Created 11-16-2021 10:28 PM
@BigData-suk, as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.
Regards,
Vidya Sargur,