question Re: After running for a period of time, job submissions timeout in the Spark on YARN cluster. in Support Questions

After running for a period of time, job submissions timeout in the Spark on YARN cluster.

Meepoljd — Fri, 15 Mar 2024 10:35:14 GMT

Hi everyone, our Hadoop cluster has recently encountered a strange issue. After starting the cluster, Spark jobs run normally, but after running for more than a week, job submission timeouts occur. Specifically, the jobs are all in the ACCEPTED state, but fail due to timeout after 2 minutes. The error log shows:

Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/134.1.1.3:45866 remote=hdp144019.bigdata.com/134.1.1.19:45454]

I found an app id, and then went to check the corresponding NodeManager log for attempting to start the AM. I found that the AM was never started at all. At the time when the job submission timed out, the NodeManager log reported the following log:

2024-03-15 00:04:17,668 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1607)) - couldn't find application application_1709910180593_38018 while processing FINISH_APPS event. The ResourceManager allocated resources for this application to the NodeManager but no active containers were found to process.

同时伴随着IPC相关的报错：

2024-03-15 00:04:17,082 WARN ipc.Server (Server.java:processResponse(1523)) - IPC Server handler 11 on 45454, call Call#31 Retry#0 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB.startContainers from 134.1.1.8:32766: output error 2024-03-15 00:04:17,083 INFO ipc.Server (Server.java:run(2695)) - IPC Server handler 11 on 45454 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:3250) at org.apache.hadoop.ipc.Server.access$1700(Server.java:137) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1473) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1543) at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2593) at org.apache.hadoop.ipc.Server$Connection.access$300(Server.java:1615) at org.apache.hadoop.ipc.Server$RpcCall.doResponse(Server.java:940) at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:774) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

Since I couldn't see any issues from the logs, I'm now unsure how to troubleshoot further. Someony have any suggestions?

Spark Version: 3.3.2

HadoopVersion: HDP-3.1.5.0-152

Re: After running for a period of time, job submissions timeout in the Spark on YARN cluster.

RangaReddy — Wed, 03 Apr 2024 06:03:55 GMT

Hi @Meepoljd

Sorry to inform you, we are not supported Spark3 installation in HDP. In order to use Spark3 you need to use CDP/CDE cluster only.

Re: After running for a period of time, job submissions timeout in the Spark on YARN cluster.

Meepoljd — Wed, 03 Apr 2024 07:17:59 GMT

I think I've found the reason for the problem. It's not related to the Spark version. I used the Java process analysis tool Arthas to investigate and found that the AM startup process was blocked at the creation of the Timeline client.

And this problem might be due to our TimeLine service using an embedded HBase service. When we configured the HBase used by the TimeLine service to our production environment's HBase, the problem disappeared.