Created 11-05-2015 05:07 AM
Env: HDP 2.1.10 / NameNode HA
Randomly the Active NameNode hits this issue where the zkfc loses connection to the NN's 8020 port. The process appears to be hung at this time and since the fencing does not actually kill the NN process, the NN later on wakes up shuts down itself after it attempts to write to to the Journal nodes, which fails since the journal node's epoch got incremented once the previously standby NN became Active.
For ref. attached the log excerpts from NNs, ZKFCs and a journal node. sequence-of-events-1.txt
We don't find much time spent during GC at the time of problem.
The issue was encountered thrice on 28-10 and we have the logs for two of the occurance: 10:17 and 11:19 And we find the following log entries in the NN log at around the time of problem:
(1) 2015-10-28 10:15:54,718 INFO hdfs.StateChange (FSNamesystem.java:completeFile(3004)) - DIR* completeFile: /data/jvc/prd02/raw/geact/current/.distcp.tmp.at tempt_1437114942510_80591_m_000002_0 is closed by DFSClient_attempt_1437114942510_80591_m_000002_0_1495077679_1 2015-10-28 10:15:54,911 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 30001 milliseconds 2015-10-28 10:17:00,577 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128194 block(s) in 65666 millisecond(s). 2015-10-28 10:17:00,578 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 65666 milliseconds 2015-10-28 10:17:00,578 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 30 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from <IP-A>:53951 Call#400424 Retry#0 2015-10-28 10:17:00,578 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 2 on 8020: skipped org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from <IP-B>:52014 Call#47097 Retry#0 2015-10-28 10:17:00,578 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 21 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from <IP-B>:52031 Call#413318 Retry#0
(2)
2015-10-28 11:18:47,007 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0 2015-10-28 11:18:47,906 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 2015-10-28 11:18:48,387 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 30001 milliseconds 2015-10-28 11:19:32,560 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128182 block(s) in 44172 millisecond(s). 2015-10-28 11:19:32,560 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 44172 milliseconds 2015-10-28 11:20:10,493 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37934 millisecond(s). 2015-10-28 11:20:10,493 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37934 milliseconds 2015-10-28 11:20:10,495 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 5 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from <IP-A>:58463 Call#406856 Retry#0 2015-10-28 11:20:10,496 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 2015-10-28 11:20:48,399 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37906 millisecond(s). 2015-10-28 11:20:48,399 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37906 milliseconds 2015-10-28 11:20:48,400 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0 2015-10-28 11:20:48,401 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 - The HDFS Audit log shows the operations/min stats as : (1) 10768 2015-10-28 10:11 1068 2015-10-28 10:12 26569 2015-10-28 10:13 6003 2015-10-28 10:14 8305 2015-10-28 10:15 46 2015-10-28 10:17 28498 2015-10-28 10:26 (2) 15098 2015-10-28 11:15 20219 2015-10-28 11:16 21364 2015-10-28 11:17 17884 2015-10-28 11:18 22 2015-10-28 11:19
- But can the blockmanagement.CacheReplicationMonitor scan cause this problem ? (note that the problem is not seen during all the blockmanagement.CacheReplicationMonitor scan operations)
Created 11-05-2015 03:36 PM
@rmaruthiyodan@hortonworks.com
Please see the following information and see if your nodes are healthy , connectivity between them is stable.
It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager.' Further reading can be found here: http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.htm...
Created 11-05-2015 03:36 PM
@rmaruthiyodan@hortonworks.com
Please see the following information and see if your nodes are healthy , connectivity between them is stable.
It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager.' Further reading can be found here: http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.htm...