Created on 02-03-2024 02:20 PM - edited 02-03-2024 02:23 PM
we have HDP Hadoop cluster with two name-node services ( one active name-node and the secondary is the standby name-node )
due of unexpected electricity failure , the standby name-node failed to start with the flowing exception , while the active name-node starting successfully
2024-02-02 08:47:11,497 INFO common.Storage (Storage.java:tryLock(776)) - Lock on /hadoop/hdfs/namenode/in_use.lock acquired by nodename 36146@master1.delax.com
2024-02-02 08:47:11,891 INFO namenode.FSImage (FSImage.java:loadFSImageFile(745)) - Planning to load image: FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141)
2024-02-02 08:47:11,897 ERROR namenode.FSImage (FSImage.java:loadFSImage(693)) - Failed to load image from FSImageFile(file=/hadoop/hdfs/namenode/current/fsimage_0000000052670667141, cpktTxId=0000000052670667141)
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:204)
at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:221)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:898)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:882)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:755)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:686)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1077)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:697)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:761)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1001)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:985)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1710)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1778)
2024-02-02 08:47:12,238 WARN namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(726)) - Encountered exception loading fsimage
java.io.IOException: Failed to load FSImage file, see error(s) above for more info.
we can see from above exception - `Failed to load image from FSImageFile` , and seems it is as results of when machine failed because unexpected shutdown
as I understand one of the options to recover the standby name-node could be with the following procedure:
1. Put Active NN in safemode
sudo -u hdfs hdfs dfsadmin -safemode enter
2. Do a savenamespace operation on Active NN
sudo -u hdfs hdfs dfsadmin -saveNamespace
3. Leave Safemode
sudo -u hdfs hdfs dfsadmin -safemode leave
4. Login to Standby NN
5. Run below command on Standby namenode to get latest fsimage that we saved in above steps.
sudo -u hdfs hdfs namenode -bootstrapStandby -force
we glad to receive any suggestions , or if my above suggestion is good enough for our problem
Created 02-05-2024 05:21 AM
=> If above steps still gives you issues then you can simply execute step 5 or below Cmd from Standby NN
Created 02-04-2024 10:00 AM
Approach you mentioned involves further downtime
If your active NN is up and running then you can simply copy the latest fsimage from active NN data dir path to Standby NN data dir path and then try to start the standby NN once again
Created 02-04-2024 10:43 AM
lets say I copy the fsimage from active to standby namenode and then still we have a problem to start the namenode then can I do the steps as already mentioned?
Created 02-05-2024 05:21 AM
=> If above steps still gives you issues then you can simply execute step 5 or below Cmd from Standby NN