Created on 11-16-2016 11:33 AM - edited 09-16-2022 01:36 AM
SYMPTOM: Standby NN crashing due to edit log corruption and complaining that OP_CLOSE cannot be applied because the file is not under-construction
ERROR:
2016-09-30T06:23:25.126-0400 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs, replication=3, mtime=1475223680193, atime=1472804384143, blockSize=134217728, blocks=[blk_1243879398_198862467], permissions=gsspe:148973_psdbpe:rwxrwxr-x, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=1585682886] java.io.IOException: File is not under construction: /appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:436) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:679) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1022) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:741) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:536) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:595) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:762) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:746) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1438) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1504)
ROOT CAUSE:
Edit log corruption can happen if append fails with a quota violation. This is BUG
https://issues.apache.org/jira/browse/HDFS-7587 https://hortonworks.jira.com/browse/BUG-56811 https://hortonworks.jira.com/browse/EAR-1248
RESOLUTION:
1. Stop everything 2. Backup the "current" folder of every journalnodes of the cluster 3. Backup the "current" folder of every namenodes of the cluster 4. Use the oev command to convert the binary editlog file into xml 5. Remove the record corresponding to the TXID mentioned in the error 6. Use the oev command to convert the xml editlog file into binary 7. Restart the active namenode 8. I got an error saying there was a gap in the editlogs 9. Take the keytab for the service nn/<host>@<REALM> 10. Execute the command hadoop namenode -recover 11. Answer "c" when the problem of gap occured 12. Then I saw other errors similar to the one I encountered at the beginning (the file not under construction issue) 13. I had to run the command hadoop namenode recover twice in order to get rid of these errors 14. Zookeeper servers were already started, so I started the journalnodes, the datanodes, the zkfc controllers and finally the active namenode 15. Some datanodes were identified as dead. After some investigations, I figured it was the information in zookeeper which were empty, so I restarted zookeeper servers and after the active namenode was there. 15. I started the standby namenode but it raised the same errors concerning the gap in the editlogs. 16. Being the user hdfs, I executed on the standby namenode the command hadoop namenode -bootstrapStandby -force 17. The new FSimage was good and identical to the one on the active namenode 18. I started the standby namenode successfully 19. I launched the rest of the cluster
Also check recovery option given in link - Namenode-Recovery