Created 06-08-2016 02:50 PM
Hello,
I have a HA cluster (HDP-2.4.0.0-169) in dev environment, with 4 nodes virtualized with vShpere.
The Name Nodes of my cluster shuts down regularly, even when there is no work on it.
Name Node log:
2016-06-02 02:23:18,682 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.1.49:8485, 192.168.1.47:8485, 192.168.1.46:8485], stream=QuorumOutputStream starting at txid 2549780)) java.io.IOException: Timed out waiting 60000ms for a quorum of nodes to respond.
It seems to come from the Journal Nodes but i can't find a way to fix the issue. Any idea where to look at?
Thanks!
Here are the Name Node detailed logs when it shutdown:
2016-06-02 02:23:12,610 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2016-06-02 02:23:12,719 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 54038 ms (timeout=60000 ms) for a response for sendEdits. No responses yet. 2016-06-02 02:23:13,720 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 55039 ms (timeout=60000 ms) for a response for sendEdits. No responses yet. 2016-06-02 02:23:14,721 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 56040 ms (timeout=60000 ms) for a response for sendEdits. No responses yet. 2016-06-02 02:23:15,611 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected Total=0 Reset bookmarks? true 2016-06-02 02:23:15,611 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2016-06-02 02:23:15,722 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 57041 ms (timeout=60000 ms) for a response for sendEdits. No responses yet. 2016-06-02 02:23:16,723 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 58043 ms (timeout=60000 ms) for a response for sendEdits. No responses yet. 2016-06-02 02:23:17,725 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 59044 ms (timeout=60000 ms) for a response for sendEdits. No responses yet. 2016-06-02 02:23:18,611 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected Total=0 Reset bookmarks? true 2016-06-02 02:23:18,611 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2016-06-02 02:23:18,682 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.1.49:8485, 192.168.1.47:8485, 192.168.1.46:8485], stream=QuorumOutputStream starting at txid 2549780)) java.io.IOException: Timed out waiting 60000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:647) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2470) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2335) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:688) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) 2016-06-02 02:23:18,682 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 2549780 2016-06-02 02:23:21,611 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected Total=0 Reset bookmarks? true 2016-06-02 02:23:21,611 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2016-06-02 02:23:24,612 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected Total=0 Reset bookmarks? true 2016-06-02 02:23:24,612 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2016-06-02 02:23:24,625 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 65944ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.47:8485 2016-06-02 02:23:24,773 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 66092ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.49:8485 2016-06-02 02:23:24,898 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(406)) - Took 66216ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.46:8485 2016-06-02 02:23:25,201 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2016-06-02 02:23:25,258 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at d1hdpmaster01/192.168.1.45 ************************************************************/ 2016-06-02 02:23:25,635 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(212)) - Unable to send metrics to collector by address:http://d1hdpmaster01:6188/ws/v1/timeline/metric
Created 06-08-2016 02:54 PM
Took 65944ms to send a batch of 1 edits (205 bytes) to remote journal 192.168.1.47:8485 2016-06-02
You either have a serious network problem b/w nodes or may be underlining disk write is very slow.
Created 06-08-2016 05:37 PM
how many Journaler nodes you have in cluster ? Make sure you have odd number.
12/10/01 17:34:18 WARN namenode.FSEditLog: Unable to determine input streams from QJM to [10.0.1.10:8485, 10.0.1.10:8486, 10.0.1.10:8487]. Skipping. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
Created 06-10-2016 07:25 AM
I have 3 Journal Noeds in my cluster, but they don't seem to fail.
Created 06-08-2016 07:06 PM
Hello @Xavier LIEUTIER.
These log messages indicate that there was a timeout condition when the NameNode attempted to call the JournalNodes. The NameNode must successfully call a quorum of JournalNodes: at least 2 out of 3. This means that the call timed out to at least 2 out of 3 of them. This is a fatal condition for the NameNode, so by design, it aborts.
There are multiple potential reasons for this timeout condition. Reviewing logs from the NameNodes and JournalNodes would likely reveal more details. There are several common causes to watch for: