Reply
Highlighted
New Contributor
Posts: 4
Registered: ‎03-06-2017

Active namenode crashes because of channel exceptions in journal node

We setup namenode HA with manual failover, 3 journal nodes.

The 3 journal nodes are setup in such a way that each namenode server runs one journal node and another journal node runs on a slave node.

 

I often see active namenode crashing when it attempts to create new log segment. I encounters time out from multiple journal nodes causing it to crash. 

 

Journal nodes report the following exception :

 

 

#####

journal node 1 :

2017-07-06 20:51:37,735 WARN org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.startLogSegment from 192.XX.XXX.98:56127 Call#7626920 Retry#0: output error
2017-07-06 20:51:37,736 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8485 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2621)
at org.apache.hadoop.ipc.Server.access$1900(Server.java:134)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:989)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1054)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2141)

 

######

Journal node 2:

2017-07-06 20:51:43,961 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485, call org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocol.startLogSegment from 192.XX.XXX.98:56757 Call#7626921 Retry#0: output error
2017-07-06 20:51:44,119 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2621)
at org.apache.hadoop.ipc.Server.access$1900(Server.java:134)
at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:989)
at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1054)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2141)

####

 

2017-07-06 20:51:07,678 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 1193538431
2017-07-06 20:51:13,679 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:14,679 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:15,680 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:16,682 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9004 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.10
0:8485]
2017-07-06 20:51:17,683 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10005 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:18,684 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:19,685 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:20,687 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:21,688 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:22,688 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:23,689 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:24,691 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.1
00:8485]
2017-07-06 20:51:25,675 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30001 milliseconds
2017-07-06 20:51:25,692 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18014 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.100:8485]
2017-07-06 20:51:26,692 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19015 ms (timeout=20000 ms) for a response for startLogSegment(1193538431). Succeeded so far: [192.XX.XXX.100:8485]
2017-07-06 20:51:27,678 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 1193538431 failed for required journal (JournalAndStream(mgr=QJM to [192.XX.XXX.98:8485, 192.XX.XXX.99:8485, 192.XX.XXX.100:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.

 

####

 

 

I have verified if there is any GC pause of IO bottleneck, I couldn't find any.  I already increased heap size, RPC handlers on journal nodes for no help. No OS level exceptions either. 

 

The exception from "sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)" makes me suspecious about some java issue with handling sockets, could it be any java bug?

 

And I am surprised why both journal nodes (running on different servers) ran into same exception at same time.

 

We are on Cloudera  CDH-5.5 version, hadoop/hdfs version 2.6.0.

Any inputs would be greatly appreciated. Thank you for reading through. Let me know if I can provide additional logs.

 

Thanks

 

 

 

Announcements