<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question External Shuffle service connection idle for more than 120seconds while there are outstanding requests. in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/External-Shuffle-service-connection-idle-for-more-than/m-p/228044#M79907</link>
    <description>&lt;P&gt;I am running a spark job on yarn. The job runs properly on the Amazon EMR. (1 Master and 2 slaves with m4.xlarge)&lt;/P&gt;&lt;P&gt;I have set up similar infra using HDP 2.6 distribution on AWS ec2 machines. While running a spark job it gets stuck in between&lt;/P&gt;&lt;P&gt;and the following error is thrown by the container.&lt;/P&gt;&lt;PRE&gt;18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.210.150.150:44343)
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Got the output locations
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 1 ms
18/06/25 07:15:31 INFO codegen.CodeGenerator: Code generated in 4.822611 ms
18/06/25 07:15:31 INFO codegen.CodeGenerator: Code generated in 8.430244 ms
18/06/25 07:17:31 ERROR server.TransportChannelHandler: Connection to ip-10-210-150-180.********/10.210.150.180:7447 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
18/06/25 07:17:31 ERROR client.TransportResponseHandler: Still have 307 requests outstanding when connection from ip-10-210-150-180.********/10.210.150.180:7447 is closed
18/06/25 07:17:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 197 outstanding blocks after 5000 ms
18/06/25 07:17:31 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-10-210-150-180.********/10.210.150.180:7447 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
18/06/25 07:17:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 166 outstanding blocks after 5000 ms
18/06/25 07:17:31 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-10-210-150-180.********/10.210.150.180:7447 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)&lt;/PRE&gt;&lt;P&gt;As per the error, the connection b/n shuffle service and the container is idle for more than 120s. This happens mainly during shuffle and I have tried increasing the timeout to larger value but with no luck.&lt;/P&gt;&lt;P&gt;I am currently running spark on yarn cluster with the following configurations.&lt;/P&gt;&lt;P&gt; spark-defaults.conf on the master machine.&lt;/P&gt;&lt;PRE&gt;spark.eventLog.dir=hdfs:///user/spark/applicationHistory
spark.eventLog.enabled=true
spark.yarn.historyServer.address=ppv-qa12-tenant8-spark-cluster-master.periscope-solutions.local:18080
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.driver.maxResultSize=0
spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.memory=5g
spark.driver.memory=1g
spark.executor.cores=4
&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;yarn-site.xml of slave machines&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;&amp;lt;configuration&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.application.classpath&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;/usr/hdp/current/spark2-client/aux/*,/etc/hadoop/conf,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.aux-services&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;spark2_shuffle&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.aux-services.mapreduce_shuffle.class&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;org.apache.hadoop.mapred.ShuffleHandler&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.aux-services.spark2_shuffle.class&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;org.apache.spark.network.yarn.YarnShuffleService&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.container-manager.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;64&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.localizer.client.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;20&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.vmem-pmem-ratio&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;5&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.resourcemanager.hostname&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;************&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.resourcemanager.resource-tracker.client.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;64&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.resourcemanager.scheduler.client.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;64&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.increment-allocation-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;32&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.increment-allocation-vcores&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;1&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.maximum-allocation-vcores&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;128&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.minimum-allocation-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;32&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.timeline-service.enabled&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.nodemanager.resource.cpu-vcores&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;8&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.nodemanager.resource.memory-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;11520&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.scheduler.maximum-allocation-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;11520&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.nodemanager.hostname&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;*************&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
&amp;lt;/configuration&amp;gt;








&lt;/PRE&gt;</description>
    <pubDate>Tue, 26 Jun 2018 13:13:57 GMT</pubDate>
    <dc:creator>prasannasaraf18</dc:creator>
    <dc:date>2018-06-26T13:13:57Z</dc:date>
    <item>
      <title>External Shuffle service connection idle for more than 120seconds while there are outstanding requests.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/External-Shuffle-service-connection-idle-for-more-than/m-p/228044#M79907</link>
      <description>&lt;P&gt;I am running a spark job on yarn. The job runs properly on the Amazon EMR. (1 Master and 2 slaves with m4.xlarge)&lt;/P&gt;&lt;P&gt;I have set up similar infra using HDP 2.6 distribution on AWS ec2 machines. While running a spark job it gets stuck in between&lt;/P&gt;&lt;P&gt;and the following error is thrown by the container.&lt;/P&gt;&lt;PRE&gt;18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@10.210.150.150:44343)
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Got the output locations
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 1 ms
18/06/25 07:15:31 INFO codegen.CodeGenerator: Code generated in 4.822611 ms
18/06/25 07:15:31 INFO codegen.CodeGenerator: Code generated in 8.430244 ms
18/06/25 07:17:31 ERROR server.TransportChannelHandler: Connection to ip-10-210-150-180.********/10.210.150.180:7447 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
18/06/25 07:17:31 ERROR client.TransportResponseHandler: Still have 307 requests outstanding when connection from ip-10-210-150-180.********/10.210.150.180:7447 is closed
18/06/25 07:17:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 197 outstanding blocks after 5000 ms
18/06/25 07:17:31 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-10-210-150-180.********/10.210.150.180:7447 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
18/06/25 07:17:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 166 outstanding blocks after 5000 ms
18/06/25 07:17:31 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-10-210-150-180.********/10.210.150.180:7447 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)&lt;/PRE&gt;&lt;P&gt;As per the error, the connection b/n shuffle service and the container is idle for more than 120s. This happens mainly during shuffle and I have tried increasing the timeout to larger value but with no luck.&lt;/P&gt;&lt;P&gt;I am currently running spark on yarn cluster with the following configurations.&lt;/P&gt;&lt;P&gt; spark-defaults.conf on the master machine.&lt;/P&gt;&lt;PRE&gt;spark.eventLog.dir=hdfs:///user/spark/applicationHistory
spark.eventLog.enabled=true
spark.yarn.historyServer.address=ppv-qa12-tenant8-spark-cluster-master.periscope-solutions.local:18080
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.driver.maxResultSize=0
spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.memory=5g
spark.driver.memory=1g
spark.executor.cores=4
&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;yarn-site.xml of slave machines&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;&amp;lt;configuration&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.application.classpath&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;/usr/hdp/current/spark2-client/aux/*,/etc/hadoop/conf,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.aux-services&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;spark2_shuffle&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.aux-services.mapreduce_shuffle.class&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;org.apache.hadoop.mapred.ShuffleHandler&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.aux-services.spark2_shuffle.class&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;org.apache.spark.network.yarn.YarnShuffleService&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.container-manager.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;64&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.localizer.client.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;20&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.nodemanager.vmem-pmem-ratio&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;5&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.resourcemanager.hostname&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;************&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.resourcemanager.resource-tracker.client.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;64&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.resourcemanager.scheduler.client.thread-count&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;64&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.increment-allocation-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;32&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.increment-allocation-vcores&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;1&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.maximum-allocation-vcores&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;128&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.scheduler.minimum-allocation-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;32&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
    &amp;lt;name&amp;gt;yarn.timeline-service.enabled&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.nodemanager.resource.cpu-vcores&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;8&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.nodemanager.resource.memory-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;11520&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.scheduler.maximum-allocation-mb&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;11520&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
  &amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;yarn.nodemanager.hostname&amp;lt;/name&amp;gt;
    &amp;lt;value&amp;gt;*************&amp;lt;/value&amp;gt;
  &amp;lt;/property&amp;gt;
&amp;lt;/configuration&amp;gt;








&lt;/PRE&gt;</description>
      <pubDate>Tue, 26 Jun 2018 13:13:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/External-Shuffle-service-connection-idle-for-more-than/m-p/228044#M79907</guid>
      <dc:creator>prasannasaraf18</dc:creator>
      <dc:date>2018-06-26T13:13:57Z</dc:date>
    </item>
    <item>
      <title>Re: External Shuffle service connection idle for more than 120seconds while there are outstanding requests.</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/External-Shuffle-service-connection-idle-for-more-than/m-p/228045#M79908</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Short Answer: &lt;/STRONG&gt;&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;Turn off scatter gather&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Long Version:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The data transfer b/n container and shuffle service happens through RPC Calls(ChunkFetchRequest, ChunkFetchSuccess and ChunkFetchFailure)&lt;/P&gt;&lt;P&gt;On further debugging with trace level logs, we found that RPC calls were indeed happening b/n the container and the shuffle service and after some time the RPC call's were abruptly suppressed(meaning no more RPC calls were logged) from both shuffle service and container.&lt;/P&gt;&lt;P&gt;On looking into kernel and system activity logs we found the following&lt;/P&gt;&lt;PRE&gt;xen_netfront: xennet: skb rides the rocket: 19 slots&lt;/PRE&gt;&lt;P&gt;That means that our ec2 machines were having network packet loss. &lt;/P&gt;&lt;P&gt;More info on this log can be found in the following thread&lt;/P&gt;&lt;P&gt;&lt;A href="http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html"&gt; http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;So we tried turning off the scatter-gather using the following command.&lt;/P&gt;&lt;PRE&gt;sudo ethtool -K eth0 sg off&lt;/PRE&gt;&lt;P&gt;The error was gone after that.&lt;/P&gt;</description>
      <pubDate>Fri, 06 Jul 2018 01:28:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/External-Shuffle-service-connection-idle-for-more-than/m-p/228045#M79908</guid>
      <dc:creator>prasannasaraf18</dc:creator>
      <dc:date>2018-07-06T01:28:01Z</dc:date>
    </item>
  </channel>
</rss>

