Support Questions

Find answers, ask questions, and share your expertise

Active / Standby namenodes for HDFS HA behave erratically

avatar
Contributor

 

Hi,

 

 

I am trying to deploy CDH4 with CM 5.3. I managed to enable HA programatically but unfortunately none of the Namenodes becomes the active one. What I see most frequently is a situation where one of them is a StandBy namenode and the other one does not report any state (neither active nor standby). The health test reports:

 

NameNode summary: aws.us-west2a.ccs-nn-1.dev.cypher (Availability: Standby, Health: Good), aws.us-west2a.ccs-nn-2.dev.cypher (Availability: Unknown, Health: Good). This health test is bad because the Service Monitor did not find an active NameNode.

 

Digging into the logs I see:

 

2015-02-24 01:14:38,197 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode aws.us-west2a.ccs-nn-2.dev.cypher/10.2.3.22:8022
2015-02-24 01:14:38,282 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Log not rolled. Name node is in safe mode.

 

 

Manually, using Cloudera Manager, I force both of the namenodes to leave safe mode and restart the service. Then I observe two behaviors.

 

1) Same thing as before, one namenode is in StandBy, and the other does not report state.

2) One of the namenodes is active and the other one does not report state.

 

 

The logs report:

 

 

015-02-24 01:10:41,417 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2015-02-24 01:10:41,421 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2015-02-24 01:10:41,437 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 2
2015-02-24 01:10:41,437 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /mnt/data1/dfs/nn/current
2015-02-24 01:10:41,441 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest edits from old active before taking over writer role in edits logs
2015-02-24 01:10:41,463 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4feaefc5 expecting start txid #1177
2015-02-24 01:10:41,464 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/mnt/data1/dfs/nn/current/edits_0000000000000001198-0000000000000001198' to transaction ID 1177
2015-02-24 01:10:41,473 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1177, but got txid 1198.
2015-02-24 01:10:41,473 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring NN shutdown. Shutting down immediately.

 

 

 

Any help is appreciated.

7 REPLIES 7

avatar
When an NN is to made primary, the ZKFC will write to the znode under
/hadoop-ha. If this cannot be done for whatever reason, the NN cannot
be made active. This is likely an issue with ZooKeeper and the ZKFC.
Check the ZKFC logs (on the same host as the NN), it should provide a
clue. If there's nothing there, check if the ZooKeeper hosts are
responding correctly (is the service up?)


Regards,
Gautam Gopalakrishnan

avatar
Contributor

Thanks Gautam

 

Your suggestion was indeed very useful. I went to the zookeeper logs and found:

 

2015-02-24 18:01:05,616 INFO org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x34bb91dede303c9 type:create cxid:0x1 zxid:0xe00002ce3 txntype:-1 reqpath:n/a Error Path:/hadoop-ha/mynameservice/ActiveStandbyElectorLock Error:KeeperErrorCode = NodeExists for /hadoop-ha/mynameservice/ActiveStandbyElectorLock

I thus tried this:

 

1) Stop hdfs

2) delete the node /hadoop-ha/mynameservice/ActiveStandbyElectorLock from zookeeper

3) Restart hdfs

 

BUT I still got the same error. It seems that some other process or an early stage of the hdfs start process creates that node. Any input on this?

avatar
Contributor

 

As I mentioned in the previous post, the error still appears, but afeter 5min now I have an active Namenode and the following error:

 

NameNode summary: aws.us-west2a.ccs-nn-1.dev.cypher (Availability: Active, Health: Good), aws.us-west2a.ccs-nn-2.dev.cypher (Availability: Unknown, Health: Good). This health test is concerning because the Service Monitor did not find a standby NameNode.

 

Zookeeper does not log anything interesting. I am sure that is related with the zookeeper lock again. One machine gets its role first (either standby or active), and the other one does not know what to do.

avatar
Contributor

 

I have another piece of information that may help:

 

HBase does exactly the same thing. There are two masters but none is active:

 

2015-02-24 18:22:33,721 INFO org.apache.zookeeper.server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14bb91dd33c03d7 type:create cxid:0x8 zxid:0xe00002e28 txntype:-1 reqpath:n/a Error Path:/hbase/online-snapshot/acquired Error:KeeperErrorCode = NodeExists for /hbase/online-snapshot/acquired

 

 

 

 

 

And MapReduce works fine, it does have an active JobTracker and the other one is on Stanby.

 

avatar
Contributor

After doing more research, I have found that the namenode that does not report state might be the active one. The logs for its FailoverController show:

 

 

2015-02-24 22:50:06,386 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2015-02-24 22:50:06,391 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2015-02-24 22:50:06,398 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence
2015-02-24 22:50:06,399 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/mynameservice/ActiveBreadCrumb to indicate that the local node is the most recent active...
2015-02-24 22:50:06,403 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at aws.us-west2a.ccs-nn-2.dev.cypher/10.2.3.22:8022 active...
2015-02-24 22:50:09,679 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at aws.us-west2a.ccs-nn-2.dev.cypher/10.2.3.22:8022 to active state

 

 

I am assuming that this could be the case. If the namenode is active, then the problem may be in the service monitor. This is just an hypothesis though. Does it make sense?

 

 

 

 

avatar
Contributor

 

After checking the individual web pages for the HDFS namenodes (port 50070) and HBase masters (port 60010) I have discovered that the services are actually working fine. There is an Active Namenode and a StandBy Namenode and the same for HBase: one Active Master and one StandBy master.

 

So, how come that the Cloudera Manager is not able to detect the state of this processes? What is the procedure that the CM uses? A REST call, a connection to a port, an entry in a database? I need to know how is done so I can keep tracking the problem.

 

Thanks!

avatar
New Contributor

Hi ,

I am also facing same issue. What was the resolution for this issue.

HDFS, HBASE and Yarn active & standby instances are working fine and showing correctly in WebUI but cloudera manager is not reporting any active standby instance . Thats why getting error as below in CDP CLoudera Manager verision 7.11

This health test is bad because the Service Monitor did not find an active NameNode.