Support Questions

Find answers, ask questions, and share your expertise

hadoop cluster + Unable to start standby Namenode

avatar

we have HDP Hadoop cluster with two name-node services ( one active name-node and the secondary is the standby name-node )

due of unexpected electricity failure , the standby name-node failed to start with the flowing exception , while the active name-node starting successfully

2024-02-02 08:47:11,497 INFO common.Storage (Storage.java:tryLock(776)) - Lock on /hadoop/hdfs/namenode/in_use.lock acquired by nodename 36146@master1.delax.com
2024-02-02 08:47:11,891 INFO namenode.FSImage (FSImage.java:loadFSImageFile(745)) - Planning to load image: FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141)
2024-02-02 08:47:11,897 ERROR namenode.FSImage (FSImage.java:loadFSImage(693)) - Failed to load image from FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141)
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204)
at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:221)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:898)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:882)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:755)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:686)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1077)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:697)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:761)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1001)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:985)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1710)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1778)
2024-02-02 08:47:12,238 WARN namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(726)) - Encountered exception loading fsimage
java.io.IOException: Failed to load FSImage file, see error(s) above for more info.


we can see from above exception - `Failed to load image from FSImageFile` , and seems it is as results of when machine failed because unexpected shutdown

as I understand one of the options to recover the standby name-node could be with the following procedure:


1. Put Active NN in safemode

sudo -u hdfs hdfs dfsadmin -safemode enter

2. Do a savenamespace operation on Active NN


sudo -u hdfs hdfs dfsadmin -saveNamespace

3. Leave Safemode

sudo -u hdfs hdfs dfsadmin -safemode leave

4. Login to Standby NN

5. Run below command on Standby namenode to get latest fsimage that we saved in above steps.

sudo -u hdfs hdfs namenode -bootstrapStandby -force


we glad to receive any suggestions , or if my above suggestion is good enough for our problem

Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Expert Contributor

=> If above steps still gives you issues then you can simply execute step 5 or below Cmd from Standby NN

// Bootstrap Standby NameNode. This command copies the contents of the Active NameNode's metadata directories (including the namespace information and most recent checkpoint) to the Standby NameNode.
 
# hdfs namenode -bootstrapStandby

Note: Step 1 to step 3 is process of creating new fsimage but if your Active NN is already up and running then I would directly login in to Standby and then perform bootstrapStandby operation

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

Approach you mentioned involves further downtime 

If your active NN is up and running then you can simply copy the latest fsimage from active NN data dir path to Standby NN data dir path and then try to start the standby NN once again 

avatar

lets say I copy the fsimage from active to standby namenode  and then still we have a problem to start the namenode then  can I do the steps as already mentioned? 

Michael-Bronson

avatar
Expert Contributor

=> If above steps still gives you issues then you can simply execute step 5 or below Cmd from Standby NN

// Bootstrap Standby NameNode. This command copies the contents of the Active NameNode's metadata directories (including the namespace information and most recent checkpoint) to the Standby NameNode.
 
# hdfs namenode -bootstrapStandby

Note: Step 1 to step 3 is process of creating new fsimage but if your Active NN is already up and running then I would directly login in to Standby and then perform bootstrapStandby operation