Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Unable to start namenode after failover

avatar
Explorer

We are running a CDH 5.4.7 cluster and after an automatic failover both Namename node now refuse to start.

 

Output :

Failed to start namenode.
java.lang.IllegalStateException
        at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
        at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getNumUnderConstructionBlocks(LeaseManager.java:119)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:6339)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1149)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:677)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:663)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553)
2015-10-28 01:07:56,579 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

 

It looks similar to https://issues.apache.org/jira/browse/HDFS-8384

 

But we can see that it is supposed to be fixed in 5.3.8 : http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_rn_fixed_in_538... are not able to run hadoop namenode -recover with the same stack trace.

 

15/10/28 01:33:39 INFO namenode.FSImage: Save namespace
15/10/28 01:33:43 ERROR namenode.FSImage: Unable to save image for /data/1/dfs/nn java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:129) at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodesUnderConstruction(LeaseManager.java:447) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFilesUnderConstruction(FSNamesystem.java:7264) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Saver.serializeFilesUCSection(FSImageFormatPBINode.java:508) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInodes(FSImageFormatProtobuf.java:431) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInternal(FSImageFormatProtobuf.java:474) at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.save(FSImageFormatProtobuf.java:410) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:958) at org.apache.hadoop.hdfs.server.namenode.FSImage$FSImageSaver.run(FSImage.java:1009) at java.lang.Thread.run(Thread.java:745)

Is there any workaround ?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

They are. 5.3.8 (Oct 20th) happened after 5.4.7 (Sep 18th). The next release of 5.4 after the 5.3.8 release will have the fix.

View solution in original post

8 REPLIES 8

avatar
Explorer

I've looked at the code provided in hadoop-hdfs-2.6.0-cdh5.4.7.jar

 

  synchronized long getNumUnderConstructionBlocks() {
    assert this.fsnamesystem.hasReadLock() : "The FSNamesystem read lock wasn't"
      + "acquired before counting under construction blocks";
    long numUCBlocks = 0;
    for (Lease lease : sortedLeases) {
      for (String path : lease.getPaths()) {
        final INodeFile cons;
        try {
          cons = this.fsnamesystem.getFSDirectory().getINode(path).asFile();
            Preconditions.checkState(cons.isUnderConstruction());
        } catch (UnresolvedLinkException e) {
          throw new AssertionError("Lease files should reside on this FS");
        }
        BlockInfo[] blocks = cons.getBlocks();
        if(blocks == null)
          continue;
        for(BlockInfo b : blocks) {
          if(!b.isComplete())
            numUCBlocks++;
        }
      }
    }
    LOG.info("Number of blocks under construction: " + numUCBlocks);
    return numUCBlocks;
  }

And it looks like the patch from HDFS-8384 was not applied to CDH 5.4.7 ??,  the commit of the patch is here :

https://github.com/apache/hadoop/commit/8928729c80af0a154524e06fb13ed9b191986a78

 

 

 

 

avatar
Explorer

We had to patch manually the jar to run the namenode again.

 

Then we were able to remove the problematic file.

 

Here is the chain of event :

 

- The secondary namenode tried to do a checkpoint but failed due to nodes under construction

ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to save image for /data/1/dfs/nn
java.lang.IllegalStateException
        at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
        at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodesUnderConstruction(LeaseManager.java:447)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFilesUnderConstruction(FSNamesystem.java:7235)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Saver.serializeFilesUCSection(FSImageFormatPBINode.java:508)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInodes(FSImageFormatProtobuf.java:431)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.saveInternal(FSImageFormatProtobuf.java:474)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Saver.save(FSImageFormatProtobuf.java:410)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:958)
        at org.apache.hadoop.hdfs.server.namenode.FSImage$FSImageSaver.run(FSImage.java:1009)
        at java.lang.Thread.run(Thread.java:745)

- Cloudera manager did warned us, with an email that we tought to be a system problem (disk related).

-  A bit after that we did a failover. then both namenode refused to start

- After looking around we found that it could be somehow related to HDFS-8384

- Since we tought that the patch HDFS-8384 was supposed to be applied to CDH 5.4.7 according to the relase notes, we looked elsewhere for the cause of the problem.

- We decided to take a look at the source code of hadoop-hdfs-2.6.0-cdh5.4.7.jar and realized that the patch was not applied

- We manually compiled the patch (just the method that was causing problem), repackaged the jar and we where able to restart the namenode, discover the faulty file and get back on our feet.

 

Shall I open a JIRA to mention that HDFS-8384 is not applied to CDH 5.4.7 ?

What can cause an INode to be under construction ?

 

Thanks

 

 

 

 

 

avatar
Community Manager

HDFS-8384 is fixed in CDH 5.3.8 per the release notes but is not in CDH 5.4.7.   It should be available in CDH 5.4.8 when it releases.



David Wilder, Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum

avatar
Explorer

The patches are not applied systematically between releases ?

 

avatar
Expert Contributor

They are. 5.3.8 (Oct 20th) happened after 5.4.7 (Sep 18th). The next release of 5.4 after the 5.3.8 release will have the fix.

avatar
Explorer

Thanks that explains why the patch was not applied!

 

Any explanation (or a link where I can find the info) on what can cause a file to be under construction?

avatar
Expert Contributor

I don't know of any such documentation.

avatar
Community Manager

5.4.8 has been released.

 

http://community.cloudera.com/t5/Release-Announcements/Announcing-Cloudera-Enterprise-5-4-8/m-p/3361...



David Wilder, Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

Terms of Service

Community Guidelines

How to use the forum