Support Questions
Find answers, ask questions, and share your expertise

zkfc looses NN connection during blockmanagement.CacheReplicationMonitor Scanning

Cloudera Employee

Env: HDP 2.1.10 / NameNode HA

Randomly the Active NameNode hits this issue where the zkfc loses connection to the NN's 8020 port. The process appears to be hung at this time and since the fencing does not actually kill the NN process, the NN later on wakes up shuts down itself after it attempts to write to to the Journal nodes, which fails since the journal node's epoch got incremented once the previously standby NN became Active.

For ref. attached the log excerpts from NNs, ZKFCs and a journal node. sequence-of-events-1.txt

We don't find much time spent during GC at the time of problem.

The issue was encountered thrice on 28-10 and we have the logs for two of the occurance: 10:17 and 11:19 And we find the following log entries in the NN log at around the time of problem:

(1) 
2015-10-28 10:15:54,718 INFO hdfs.StateChange (FSNamesystem.java:completeFile(3004)) - DIR* completeFile: /data/jvc/prd02/raw/geact/current/.distcp.tmp.at 
tempt_1437114942510_80591_m_000002_0 is closed by DFSClient_attempt_1437114942510_80591_m_000002_0_1495077679_1 
2015-10-28 10:15:54,911 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 30001 milliseconds 
2015-10-28 10:17:00,577 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128194 block(s) in 65666 millisecond(s). 
2015-10-28 10:17:00,578 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 65666 milliseconds 
2015-10-28 10:17:00,578 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 30 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from <IP-A>:53951 Call#400424 Retry#0 
2015-10-28 10:17:00,578 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 2 on 8020: skipped org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from <IP-B>:52014 Call#47097 Retry#0 
2015-10-28 10:17:00,578 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 21 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from <IP-B>:52031 Call#413318 Retry#0 

(2)

2015-10-28 11:18:47,007 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0 
2015-10-28 11:18:47,906 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 
2015-10-28 11:18:48,387 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 30001 milliseconds 
2015-10-28 11:19:32,560 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128182 block(s) in 44172 millisecond(s). 
2015-10-28 11:19:32,560 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 44172 milliseconds 
2015-10-28 11:20:10,493 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37934 millisecond(s). 
2015-10-28 11:20:10,493 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37934 milliseconds 
2015-10-28 11:20:10,495 INFO ipc.Server (Server.java:run(1990)) - IPC Server handler 5 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from <IP-A>:58463 Call#406856 Retry#0 
2015-10-28 11:20:10,496 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0 
2015-10-28 11:20:48,399 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(202)) - Scanned 7859 directive(s) and 128173 block(s) in 37906 millisecond(s). 
2015-10-28 11:20:48,399 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(177)) - Rescanning after 37906 milliseconds 
2015-10-28 11:20:48,400 DEBUG BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1407)) - BLOCK* neededReplications = 0 pendingReplications = 0 
2015-10-28 11:20:48,401 INFO namenode.FSNamesystem (FSNamesystem.java:listCorruptFileBlocks(6216)) - list corrupt file blocks returned: 0
- The HDFS Audit log shows the operations/min stats as :
(1) 
10768 2015-10-28 10:11
1068 2015-10-28 10:12
26569 2015-10-28 10:13
6003 2015-10-28 10:14
8305 2015-10-28 10:15
46 2015-10-28 10:17
28498 2015-10-28 10:26
(2)
15098 2015-10-28 11:15
20219 2015-10-28 11:16
21364 2015-10-28 11:17
17884 2015-10-28 11:18
  22 2015-10-28 11:19

- But can the blockmanagement.CacheReplicationMonitor scan cause this problem ? (note that the problem is not seen during all the blockmanagement.CacheReplicationMonitor scan operations)

1 ACCEPTED SOLUTION

Accepted Solutions

@rmaruthiyodan@hortonworks.com

Please see the following information and see if your nodes are healthy , connectivity between them is stable.

It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager.' Further reading can be found here: http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.htm...

  • If NameNode remains unresponsive for long enough, ZooKeeper notices, and gives control to the HA backup NameNode. The backup NameNode increments the epoch-count in the journal nodes (as it should) and takes over control of HDFS. Eventually, the AD call returns, and the former NameNode wakes up, notices that the epoch-count in the journal nodes has inexplicably increased by one, and shuts itself down, as it should do in response to this condition. (It is designed to do this to avoid two NameNodes in a split-brain situation.)

View solution in original post

1 REPLY 1

@rmaruthiyodan@hortonworks.com

Please see the following information and see if your nodes are healthy , connectivity between them is stable.

It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager.' Further reading can be found here: http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.htm...

  • If NameNode remains unresponsive for long enough, ZooKeeper notices, and gives control to the HA backup NameNode. The backup NameNode increments the epoch-count in the journal nodes (as it should) and takes over control of HDFS. Eventually, the AD call returns, and the former NameNode wakes up, notices that the epoch-count in the journal nodes has inexplicably increased by one, and shuts itself down, as it should do in response to this condition. (It is designed to do this to avoid two NameNodes in a split-brain situation.)

View solution in original post