Created 06-08-2016 02:50 PM
Hello,
I have a HA cluster (HDP-2.4.0.0-169) in dev environment, with 4 nodes virtualized with vShpere.
The Name Nodes of my cluster shuts down regularly, even when there is no work on it.
Name Node log:
2016-06-02 02:23:18,682 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.1.49:8485, 192.168.1.47:8485, 192.168.1.46:8485], stream=QuorumOutputStream starting at txid 2549780)) java.io.IOException: Timed out waiting 60000ms for a quorum of nodes to respond.
It seems to come from the Journal Nodes but i can't find a way to fix the issue. Any idea where to look at?
Thanks!
Here are the Name Node detailed logs when it shutdown:
2016-06-02 02:23:12,610 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2016-06-02 02:23:12,719 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 54038 ms (timeout=60000 ms) for a response for sendEdits. No responses yet.
2016-06-02 02:23:13,720 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 55039 ms (timeout=60000 ms) for a response for sendEdits. No responses yet.
2016-06-02 02:23:14,721 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 56040 ms (timeout=60000 ms) for a response for sendEdits. No responses yet.
2016-06-02 02:23:15,611 INFO  BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected  Total=0 Reset bookmarks? true
2016-06-02 02:23:15,611 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2016-06-02 02:23:15,722 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 57041 ms (timeout=60000 ms) for a response for sendEdits. No responses yet.
2016-06-02 02:23:16,723 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 58043 ms (timeout=60000 ms) for a response for sendEdits. No responses yet.
2016-06-02 02:23:17,725 WARN  client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 59044 ms (timeout=60000 ms) for a response for sendEdits. No responses yet.
2016-06-02 02:23:18,611 INFO  BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected  Total=0 Reset bookmarks? true
2016-06-02 02:23:18,611 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2016-06-02 02:23:18,682 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.1.49:8485, 192.168.1.47:8485, 192.168.1.46:8485], stream=QuorumOutputStream starting at txid 2549780))
java.io.IOException: Timed out waiting 60000ms for a quorum of nodes to respond.
                at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
                at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
                at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
                at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
                at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
                at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
                at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
                at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
                at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2470)
                at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2335)
                at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:688)
                at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397)
                at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
                at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
                at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
                at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
                at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
2016-06-02 02:23:18,682 WARN  client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 2549780
2016-06-02 02:23:21,611 INFO  BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected  Total=0 Reset bookmarks? true
2016-06-02 02:23:21,611 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2016-06-02 02:23:24,612 INFO  BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected  Total=0 Reset bookmarks? true
2016-06-02 02:23:24,612 INFO  BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0.
2016-06-02 02:23:24,625 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 65944ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.47:8485
2016-06-02 02:23:24,773 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 66092ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.49:8485
2016-06-02 02:23:24,898 WARN  client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 66216ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.46:8485
2016-06-02 02:23:25,201 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1
2016-06-02 02:23:25,258 INFO  namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at d1hdpmaster01/192.168.1.45
************************************************************/
2016-06-02 02:23:25,635 WARN  timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(212)) - Unable to send metrics to collector by address:http://d1hdpmaster01:6188/ws/v1/timeline/metric
					
				
			
			
				
			
			
			
			
			
			
			
		Created 06-08-2016 02:54 PM
Took 65944ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.47:8485 2016-06-02
You either have a serious network problem b/w nodes or may be underlining disk write is very slow.
Created 06-08-2016 05:37 PM
how many Journaler nodes you have in cluster ? Make sure you have odd number.
12/10/01 17:34:18 WARN namenode.FSEditLog: Unable to determine input streams from QJM to [10.0.1.10:8485, 10.0.1.10:8486, 10.0.1.10:8487]. Skipping. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
Created 06-10-2016 07:25 AM
I have 3 Journal Noeds in my cluster, but they don't seem to fail.
Created 06-08-2016 07:06 PM
Hello @Xavier LIEUTIER.
These log messages indicate that there was a timeout condition when the NameNode attempted to call the JournalNodes. The NameNode must successfully call a quorum of JournalNodes: at least 2 out of 3. This means that the call timed out to at least 2 out of 3 of them. This is a fatal condition for the NameNode, so by design, it aborts.
There are multiple potential reasons for this timeout condition. Reviewing logs from the NameNodes and JournalNodes would likely reveal more details. There are several common causes to watch for: