Created on 02-23-2015 05:47 PM - edited 09-16-2022 02:22 AM
Hi,
I am trying to deploy CDH4 with CM 5.3. I managed to enable HA programatically but unfortunately none of the Namenodes becomes the active one. What I see most frequently is a situation where one of them is a StandBy namenode and the other one does not report any state (neither active nor standby). The health test reports:
NameNode summary: aws.us-west2a.ccs-nn-1.dev.cypher (Availability: Standby, Health: Good), aws.us-west2a.ccs-nn-2.dev.cypher (Availability: Unknown, Health: Good). This health test is bad because the Service Monitor did not find an active NameNode.
Digging into the logs I see:
2015-02-24 01:14:38,197 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode aws.us-west2a.ccs-nn-2.dev.cypher/10.2.3.22:8022
2015-02-24 01:14:38,282 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Log not rolled. Name node is in safe mode.
Manually, using Cloudera Manager, I force both of the namenodes to leave safe mode and restart the service. Then I observe two behaviors.
1) Same thing as before, one namenode is in StandBy, and the other does not report state.
2) One of the namenodes is active and the other one does not report state.
The logs report:
015-02-24 01:10:41,417 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2015-02-24 01:10:41,421 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2015-02-24 01:10:41,437 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 2
2015-02-24 01:10:41,437 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Recovering unfinalized segments in /mnt/data1/dfs/nn/current
2015-02-24 01:10:41,441 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest edits from old active before taking over writer role in edits logs
2015-02-24 01:10:41,463 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@4feaefc5 expecting start txid #1177
2015-02-24 01:10:41,464 INFO org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding stream '/mnt/data1/dfs/nn/current/edits_0000000000000001198-0000000000000001198' to transaction ID 1177
2015-02-24 01:10:41,473 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException: There appears to be a gap in the edit log. We expected txid 1177, but got txid 1198.
2015-02-24 01:10:41,473 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring NN shutdown. Shutting down immediately.
Any help is appreciated.