Support Questions

Find answers, ask questions, and share your expertise

After running for a period of time, job submissions timeout in the Spark on YARN cluster.

avatar
Rising Star

Hi everyone, our Hadoop cluster has recently encountered a strange issue. After starting the cluster, Spark jobs run normally, but after running for more than a week, job submission timeouts occur. Specifically, the jobs are all in the ACCEPTED state, but fail due to timeout after 2 minutes. The error log shows:

Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/134.1.1.3:45866 remote=hdp144019.bigdata.com/134.1.1.19:45454]

I found an app id, and then went to check the corresponding NodeManager log for attempting to start the AM. I found that the AM was never started at all. At the time when the job submission timed out, the NodeManager log reported the following log:

2024-03-15 00:04:17,668 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:handle(1607)) - couldn't find application application_1709910180593_38018 while processing FINISH_APPS event. The ResourceManager allocated resources for this application to the NodeManager but no active containers were found to process.

同时伴随着IPC相关的报错:

2024-03-15 00:04:17,082 WARN  ipc.Server (Server.java:processResponse(1523)) - IPC Server handler 11 on 45454, call Call#31 Retry#0 org.apache.hadoop.yarn.api.ContainerManagementProtocolPB.startContainers from 134.1.1.8:32766: output error
2024-03-15 00:04:17,083 INFO  ipc.Server (Server.java:run(2695)) - IPC Server handler 11 on 45454 caught an exception
java.nio.channels.ClosedChannelException
	at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:459)
	at org.apache.hadoop.ipc.Server.channelWrite(Server.java:3250)
	at org.apache.hadoop.ipc.Server.access$1700(Server.java:137)
	at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1473)
	at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1543)
	at org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2593)
	at org.apache.hadoop.ipc.Server$Connection.access$300(Server.java:1615)
	at org.apache.hadoop.ipc.Server$RpcCall.doResponse(Server.java:940)
	at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:774)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:885)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

Since I couldn't see any issues from the logs, I'm now unsure how to troubleshoot further. Someony have any suggestions?

Spark Version: 3.3.2

HadoopVersion: HDP-3.1.5.0-152

1 ACCEPTED SOLUTION

avatar
Rising Star

I think I've found the reason for the problem. It's not related to the Spark version. I used the Java process analysis tool Arthas to investigate and found that the AM startup process was blocked at the creation of the Timeline client.

And this problem might be due to our TimeLine service using an embedded HBase service. When we configured the HBase used by the TimeLine service to our production environment's HBase, the problem disappeared.

 
 

View solution in original post

2 REPLIES 2

avatar
Master Collaborator

Hi @Meepoljd 

Sorry to inform you, we are not supported Spark3 installation in HDP. In order to use Spark3 you need to use CDP/CDE cluster only.

avatar
Rising Star

I think I've found the reason for the problem. It's not related to the Spark version. I used the Java process analysis tool Arthas to investigate and found that the AM startup process was blocked at the creation of the Timeline client.

And this problem might be due to our TimeLine service using an embedded HBase service. When we configured the HBase used by the TimeLine service to our production environment's HBase, the problem disappeared.