Created 05-20-2016 11:22 PM
Both Namenode are crashed (Active & Standby). I restarted the Active and it is serving. But we are unable to restart the standby NN. I tried to manually restart it but still it is failed. How do I recover and restart the standby Namenode.
Version: HDP 2.2
2016-05-20 18:53:57,954 INFO namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream 'http://usw2stdpma01.glassdoor.local:8480/getJournal?jid=dfs-nameservices&segmentTxId=14726901&storageInfo=-60%3A761966699%3A0%3ACID-d16e0895-7c12-404e-9223-952d1b19ace0' to transaction ID 13013207 2016-05-20 18:53:58,216 WARN namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(750)) - Encountered exception loading fsimage java.io.IOException: There appears to be a gap in the edit log. We expected txid 13013207, but got txid 14726901. at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:212) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:140) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:684) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1032) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:748) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:538) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:597) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:764) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:748) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1441) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1507) 2016-05-20 18:53:58,322 FATAL namenode.NameNode (NameNode.java:main(1512)) - Failed to start namenode. java.io.IOException: There appears to be a gap in the edit log. We expected txid 13013207, but got txid 14726901. at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:212) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:140) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:684) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1032) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:748) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:538) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:597) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:764) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:748) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1441) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1507) 2016-05-20 18:53:58,324 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2016-05-20 18:53:58,325 INFO namenode.NameNode (StringUtils.java:run(659)) - SHUTDOWN_MSG
Created 05-21-2016 07:18 PM
Please run below commands by root user.
sudo -u hdfs hdfs dfsadmin -safemode enter
2. Do a savenamespace operation on Active NN
sudo -u hdfs hdfs dfsadmin -saveNamespace
3. Leave Safemode
sudo -u hdfs hdfs dfsadmin -safemode leave
4. Login to Standby NN
5. Run below command on Standby namenode to get latest fsimage that we saved in above steps.
sudo -u hdfs hdfs namenode -bootstrapStandby -force
Created 01-26-2018 08:42 PM
We just ran into this problem. @Jeff Arnold above is correct that since the standby namenode is down the dfsadmin commands will fail. Instead of the doing the /etc/hosts file change he recommends you can manually override the -fs in the commands suggested here https://issues-test.apache.org/jira/browse/HDFS-8277?focusedCommentId=14517247&page=com.atlassian.ji....
The dfsadmin commands change to this for example
sudo -u hdfs hdfs dfsadmin -fs hdfs://<active_namenode>:<rpc_port> -safemode enter
Also if you are using Cloudera Manager the config that gets used by "namenode -bootstrapStandby" command does not include the necessary config for the journal nodes for shared edits. You will need to copy the running config from the running active namenode. It will be under something like /run/cloudera-scm-agent/process/5134-hdfs-NAMENODE. Copy that to the standby namenode and set the bootstrap command to use it.
sudo -i -u hdfs HADOOP_CONF_DIR=<your_copied_config> hdfs namenode -bootstrapStandby -force