<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: After running for a period of time, job submissions timeout in the Spark on YARN cluster. in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/After-running-for-a-period-of-time-job-submissions-timeout/m-p/385912#M245879</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/94989"&gt;@Meepoljd&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sorry to inform you, we are not supported Spark3 installation in HDP. In order to use Spark3 you need to use CDP/CDE cluster only.&lt;/P&gt;</description>
    <pubDate>Wed, 03 Apr 2024 06:03:55 GMT</pubDate>
    <dc:creator>RangaReddy</dc:creator>
    <dc:date>2024-04-03T06:03:55Z</dc:date>
    <item>
      <title>After running for a period of time, job submissions timeout in the Spark on YARN cluster.</title>
      <link>https://community.cloudera.com/t5/Support-Questions/After-running-for-a-period-of-time-job-submissions-timeout/m-p/385037#M245598</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hi everyone, our Hadoop cluster has recently encountered a strange issue. After starting the cluster, Spark jobs run normally, but after running for more than a week, job submission timeouts occur. Specifically, the jobs are all in the ACCEPTED state, but fail due to timeout after 2 minutes. The error log shows:&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="java"&gt;Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/134.1.1.3:45866 remote=hdp144019.bigdata.com/134.1.1.19:45454]&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;I found an app id, and then went to check the corresponding NodeManager log for attempting to start the AM. I found that the AM was never started at all. At the time when the job submission timed out, the NodeManager log reported the following log:&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="java"&gt;2024-03-15 00:04:17,668 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1607)) - couldn't find application application_1709910180593_38018 while processing FINISH_APPS event. The ResourceManager allocated resources for this application to the NodeManager but no active containers were found to process.&lt;/LI-CODE&gt;&lt;P&gt;同时伴随着IPC相关的报错：&lt;/P&gt;&lt;LI-CODE lang="java"&gt;2024-03-15 00:04:17,082 WARN  ipc.Server (Server.java:processResponse(1523)) - IPC Server handler 11 on 45454, call Call#31 Retry#0 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB.startContainers from 134.1.1.8:32766: output error
2024-03-15 00:04:17,083 INFO  ipc.Server (Server.java:run(2695)) - IPC Server handler 11 on 45454 caught an exception
java.nio.channels.ClosedChannelException
	at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459)
	at org.apache.hadoop.ipc.Server.channelWrite(Server.java:3250)
	at org.apache.hadoop.ipc.Server.access$1700(Server.java:137)
	at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1473)
	at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1543)
	at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2593)
	at org.apache.hadoop.ipc.Server$Connection.access$300(Server.java:1615)
	at org.apache.hadoop.ipc.Server$RpcCall.doResponse(Server.java:940)
	at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:774)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;Since I couldn't see any issues from the logs, I'm now unsure how to troubleshoot further. Someony have any suggestions?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Spark Version: 3.3.2&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;HadoopVersion: HDP-3.1.5.0-152&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 15 Mar 2024 10:35:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/After-running-for-a-period-of-time-job-submissions-timeout/m-p/385037#M245598</guid>
      <dc:creator>Meepoljd</dc:creator>
      <dc:date>2024-03-15T10:35:14Z</dc:date>
    </item>
    <item>
      <title>Re: After running for a period of time, job submissions timeout in the Spark on YARN cluster.</title>
      <link>https://community.cloudera.com/t5/Support-Questions/After-running-for-a-period-of-time-job-submissions-timeout/m-p/385912#M245879</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/94989"&gt;@Meepoljd&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sorry to inform you, we are not supported Spark3 installation in HDP. In order to use Spark3 you need to use CDP/CDE cluster only.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Apr 2024 06:03:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/After-running-for-a-period-of-time-job-submissions-timeout/m-p/385912#M245879</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2024-04-03T06:03:55Z</dc:date>
    </item>
    <item>
      <title>Re: After running for a period of time, job submissions timeout in the Spark on YARN cluster.</title>
      <link>https://community.cloudera.com/t5/Support-Questions/After-running-for-a-period-of-time-job-submissions-timeout/m-p/385923#M245884</link>
      <description>&lt;DIV class="flex-1 overflow-hidden"&gt;&lt;DIV class="react-scroll-to-bottom--css-eruaw-79elbk h-full"&gt;&lt;DIV class="react-scroll-to-bottom--css-eruaw-1n7m0yu"&gt;&lt;DIV class="flex flex-col text-sm pb-9"&gt;&lt;DIV class="w-full text-token-text-primary"&gt;&lt;DIV class="px-4 py-2 justify-center text-base md:gap-6 m-auto"&gt;&lt;DIV class="flex flex-1 text-base mx-auto gap-3 md:px-5 lg:px-1 xl:px-5 md:max-w-3xl lg:max-w-[40rem] xl:max-w-[48rem] group final-completion"&gt;&lt;DIV class="relative flex w-full flex-col agent-turn"&gt;&lt;DIV class="flex-col gap-1 md:gap-3"&gt;&lt;DIV class="flex flex-grow flex-col max-w-full"&gt;&lt;DIV class="min-h-[20px] text-message flex flex-col items-start gap-3 whitespace-pre-wrap break-words [.text-message+&amp;amp;]:mt-5 overflow-x-auto"&gt;&lt;DIV class="markdown prose w-full break-words dark:prose-invert light"&gt;&lt;P&gt;I think I've found the reason for the problem. It's not related to the Spark version. I used the Java process analysis tool Arthas to investigate and found that the AM startup process was blocked at the creation of the Timeline client.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;And this problem might be due to our TimeLine service using an embedded HBase service. When we configured the HBase used by the TimeLine service to our production environment's HBase, the problem disappeared.&lt;/SPAN&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class="relative flex h-full flex-1 flex-col"&gt;&lt;DIV class="absolute bottom-full left-0 right-0"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class="flex w-full items-center"&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 03 Apr 2024 07:17:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/After-running-for-a-period-of-time-job-submissions-timeout/m-p/385923#M245884</guid>
      <dc:creator>Meepoljd</dc:creator>
      <dc:date>2024-04-03T07:17:59Z</dc:date>
    </item>
  </channel>
</rss>

