Support Questions

Find answers, ask questions, and share your expertise

Active namenode crashes because of channel exceptions in journal node

avatar
Explorer

We setup namenode HA with manual failover, 3 journal nodes.

The 3 journal nodes are setup in such a way that each namenode server runs one journal node and another journal node runs on a slave node.

 

I often see active namenode crashing when it attempts to create new log segment. I encounters time out from multiple journal nodes causing it to crash. 

 

Journal nodes report the following exception :

 

 

#####

journal node 1 :

2017-07-06 20:51:37,735 WARN org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.startLogSegment from 192.XX.XXX.98:56127 Call#7626920 Retry#0: output error
2017-07-06 20:51:37,736 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8485 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2621)
at org.apache.hadoop.ipc.Server.access$1900(Server.java:134)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:989)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1054)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2141)

 

######

Journal node 2:

2017-07-06 20:51:43,961 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.startLogSegment from 192.XX.XXX.98:56757 Call#7626921 Retry#0: output error
2017-07-06 20:51:44,119 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2621)
at org.apache.hadoop.ipc.Server.access$1900(Server.java:134)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:989)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1054)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2141)

####

 

2017-07-06 20:51:07,678 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 1193538431
2017-07-06 20:51:13,679 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:14,679 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:15,680 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:16,682 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9004 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:17,683 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10005 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:18,684 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:19,685 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:20,687 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:21,688 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:22,688 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:23,689 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:24,691 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:25,675 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30001 milliseconds
2017-07-06 20:51:25,692 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18014 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.100:8485]
2017-07-06 20:51:26,692 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19015 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.100:8485]
2017-07-06 20:51:27,678 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 1193538431 failed for required journal (JournalAndStream(mgr=QJM to [192.XX.XXX.98:8485, 192.XX.XXX.99:8485, 192.XX.XXX.100:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.

 

####

 

 

I have verified if there is any GC pause of IO bottleneck, I couldn't find any.  I already increased heap size, RPC handlers on journal nodes for no help. No OS level exceptions either. 

 

The exception from "sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)" makes me suspecious about some java issue with handling sockets, could it be any java bug?

 

And I am surprised why both journal nodes (running on different servers) ran into same exception at same time.

 

We are on Cloudera  CDH-5.5 version, hadoop/hdfs version 2.6.0.

Any inputs would be greatly appreciated. Thank you for reading through. Let me know if I can provide additional logs.

 

Thanks

 

 

 

2 REPLIES 2

avatar
New Contributor

Hi, I am also facing the same issue. Did you find the solution for this? Regards, Abhishek

avatar
Expert Contributor

 Besides Hadoop and JVM, would you please also check the hardware?

 

Specifically, the JN's volume may be slow (check JN log message which may indicate a slow write), or network connection.