Community Articles

sshimpi · ‎11-16-2016

SYMPTOM: Standby NN crashing due to edit log corruption and complaining that OP_CLOSE cannot be applied because the file is not under-construction

ERROR:

2016-09-30T06:23:25.126-0400 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=/appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs, replication=3, mtime=1475223680193, atime=1472804384143, blockSize=134217728, blocks=[blk_1243879398_198862467], permissions=gsspe:148973_psdbpe:rwxrwxr-x, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=1585682886] 
java.io.IOException: File is not under construction: /appdata/148973_perfengp/TARGET/092016/tempdb.TARGET.092016.hdfs 
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:436) 
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:230) 
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139) 
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:824) 
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:679) 
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281) 
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1022) 
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:741) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:536) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:595) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:762) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:746) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1438) 
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1504)

ROOT CAUSE:

Edit log corruption can happen if append fails with a quota violation. This is BUG

https://issues.apache.org/jira/browse/HDFS-7587
https://hortonworks.jira.com/browse/BUG-56811
https://hortonworks.jira.com/browse/EAR-1248

RESOLUTION:

1. Stop everything
2. Backup the "current" folder of every journalnodes of the cluster
3. Backup the "current" folder of every namenodes of the cluster
4. Use the oev command to convert the binary editlog file into xml
5. Remove the record corresponding to the TXID mentioned in the error
6. Use the oev command to convert the xml editlog file into binary
7. Restart the active namenode
8. I got an error saying there was a gap in the editlogs
9. Take the keytab for the service nn/<host>@<REALM>
10. Execute the command hadoop namenode -recover
11. Answer "c" when the problem of gap occured
12. Then I saw other errors similar to the one I encountered at the beginning (the file not under construction issue)
13. I had to run the command hadoop namenode recover twice in order to get rid of these errors
14. Zookeeper servers were already started, so I started the journalnodes, the datanodes, the zkfc controllers and finally the active namenode
15. Some datanodes were identified as dead. After some investigations, I figured it was the information in zookeeper which were empty, so I restarted zookeeper servers and after the active namenode was there.
15. I started the standby namenode but it raised the same errors concerning the gap in the editlogs.
16. Being the user hdfs, I executed on the standby namenode the command hadoop namenode -bootstrapStandby -force
17. The new FSimage was good and identical to the one on the active namenode
18. I started the standby namenode successfully
19. I launched the rest of the cluster

Also check recovery option given in link - Namenode-Recovery

Cloudera Community

Community Articles

Standby namenode crashing due to edit log corruption

Apache Hadoop

Apache Zookeeper

HDFS