Created on 10-27-2015 10:37 PM - edited 10-27-2015 11:28 PM
We are running a CDH 5.4.7 cluster and after an automatic failover both Namename node now refuse to start.
Output :
Failed to start namenode.
java.lang.IllegalStateException
        at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
        at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getNumUnderConstructionBlocks(LeaseManager.java:119)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:6339)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1149)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:677)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:663)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553)
2015-10-28 01:07:56,579 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
It looks similar to https://issues.apache.org/jira/browse/HDFS-8384
But we can see that it is supposed to be fixed in 5.3.8 : http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_rn_fixed_in_538... are not able to run hadoop namenode -recover with the same stack trace.
15/10/28 01:33:39 INFO namenode.FSImage: Save namespace
15/10/28 01:33:43 ERROR namenode.FSImage: Unable to save image for /data/1/dfs/nn java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:129) at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodesUnderConstruction(LeaseManager.java:447) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFilesUnderConstruction(FSNamesystem.java:7264) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Saver.serializeFilesUCSection(FSImageFormatPBINode.java:508) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInodes(FSImageFormatProtobuf.java:431) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInternal(FSImageFormatProtobuf.java:474) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.save(FSImageFormatProtobuf.java:410) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:958) at org.apache.hadoop.hdfs.server.namenode.FSImage$FSImageSaver.run(FSImage.java:1009) at java.lang.Thread.run(Thread.java:745)
Is there any workaround ?
Created 10-29-2015 10:12 AM
They are. 5.3.8 (Oct 20th) happened after 5.4.7 (Sep 18th). The next release of 5.4 after the 5.3.8 release will have the fix.
Created 10-27-2015 10:52 PM
I've looked at the code provided in hadoop-hdfs-2.6.0-cdh5.4.7.jar
  synchronized long getNumUnderConstructionBlocks() {
    assert this.fsnamesystem.hasReadLock() : "The FSNamesystem read lock wasn't"
      + "acquired before counting under construction blocks";
    long numUCBlocks = 0;
    for (Lease lease : sortedLeases) {
      for (String path : lease.getPaths()) {
        final INodeFile cons;
        try {
          cons = this.fsnamesystem.getFSDirectory().getINode(path).asFile();
            Preconditions.checkState(cons.isUnderConstruction());
        } catch (UnresolvedLinkException e) {
          throw new AssertionError("Lease files should reside on this FS");
        }
        BlockInfo[] blocks = cons.getBlocks();
        if(blocks == null)
          continue;
        for(BlockInfo b : blocks) {
          if(!b.isComplete())
            numUCBlocks++;
        }
      }
    }
    LOG.info("Number of blocks under construction: " + numUCBlocks);
    return numUCBlocks;
  }And it looks like the patch from HDFS-8384 was not applied to CDH 5.4.7 ??, the commit of the patch is here :
https://github.com/apache/hadoop/commit/8928729c80af0a154524e06fb13ed9b191986a78
Created 10-29-2015 08:12 AM
We had to patch manually the jar to run the namenode again.
Then we were able to remove the problematic file.
Here is the chain of event :
- The secondary namenode tried to do a checkpoint but failed due to nodes under construction
ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to save image for /data/1/dfs/nn
java.lang.IllegalStateException
        at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
        at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodesUnderConstruction(LeaseManager.java:447)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFilesUnderConstruction(FSNamesystem.java:7235)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Saver.serializeFilesUCSection(FSImageFormatPBINode.java:508)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInodes(FSImageFormatProtobuf.java:431)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInternal(FSImageFormatProtobuf.java:474)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.save(FSImageFormatProtobuf.java:410)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:958)
        at org.apache.hadoop.hdfs.server.namenode.FSImage$FSImageSaver.run(FSImage.java:1009)
        at java.lang.Thread.run(Thread.java:745)- Cloudera manager did warned us, with an email that we tought to be a system problem (disk related).
- A bit after that we did a failover. then both namenode refused to start
- After looking around we found that it could be somehow related to HDFS-8384
- Since we tought that the patch HDFS-8384 was supposed to be applied to CDH 5.4.7 according to the relase notes, we looked elsewhere for the cause of the problem.
- We decided to take a look at the source code of hadoop-hdfs-2.6.0-cdh5.4.7.jar and realized that the patch was not applied
- We manually compiled the patch (just the method that was causing problem), repackaged the jar and we where able to restart the namenode, discover the faulty file and get back on our feet.
Shall I open a JIRA to mention that HDFS-8384 is not applied to CDH 5.4.7 ?
What can cause an INode to be under construction ?
Thanks
Created 10-29-2015 09:52 AM
HDFS-8384 is fixed in CDH 5.3.8 per the release notes but is not in CDH 5.4.7. It should be available in CDH 5.4.8 when it releases.
David Wilder, Community Manager
Created 10-29-2015 10:09 AM
The patches are not applied systematically between releases ?
Created 10-29-2015 10:12 AM
They are. 5.3.8 (Oct 20th) happened after 5.4.7 (Sep 18th). The next release of 5.4 after the 5.3.8 release will have the fix.
Created 10-29-2015 10:17 AM
Thanks that explains why the patch was not applied!
Any explanation (or a link where I can find the info) on what can cause a file to be under construction?
Created 10-29-2015 10:53 AM
I don't know of any such documentation.
Created 10-30-2015 09:58 AM
5.4.8 has been released.
David Wilder, Community Manager
 
					
				
				
			
		
