Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NameNodes Refuse to Start; Unable to Recover?

avatar
Contributor

Following changes to the cluster to implement Kerberos, I tried to start up the cluster. The message I get is:

 

Failed to start namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0.  Expected transaction ID was 3290752
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:684)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:589)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:818)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:797)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 3290751; expected file to go up to 3290785
    at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
    at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
    at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
    at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
    at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:186)
    ... 12 more

 

Just before implementing Kerberos I did take a backup of the metadata on both namenodes, both of which are in HA. I packed the dfs.data.dir "/dfs/nn/" directory into a tar file. I did the same just now.

 

Ok so I tried first to use hdfs fsck command, and variations of it fails: " sudo hadoop fsck / " gives me:

 

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

16/03/19 21:07:45 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over servername.domain.domain/xxx.xxx.xx.xxx:8020 after 1 fail over attempts. Trying to fail over after sleeping for 641ms.
java.net.ConnectException: Call From servername.domain.domain/xxx.xxx.xx.xxx to servername.domain.domain:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
        at org.apache.hadoop.ipc.Client.call(Client.java:1476)
        at org.apache.hadoop.ipc.Client.call(Client.java:1403)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)

 

I then decide to try using "namenode -recover"

 

16/03/19 20:53:05 INFO namenode.NameNode: createNameNode [-recover]
You have selected Metadata Recovery mode.  This mode is intended to recover lost metadata on a corrupt filesystem.  Metadata recovery mode often permanently deletes data from your HDFS filesystem.  Please back up your edit log and fsimage before trying this!

Are you ready to proceed? (Y/N)
 (Y or N) y
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: starting recovery...
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Be                                                                                           ware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) con                                                                                           figured. Beware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 INFO namenode.FSNamesystem: No KeyProvider found.
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsLock is fair:true
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:0                                                                                           0:00.000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: The block deletion will start around 2016 Mar 19 20:53:22
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map BlocksMap
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^21 = 2097152 entries
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.key.update.interval=600 min(s), dfs.block.acces                                                                                           s.token.lifetime=600 min(s), dfs.encrypt.data.transfer.algorithm=null
16/03/19 20:53:22 INFO blockmanagement.BlockManager: defaultReplication         = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplication             = 512
16/03/19 20:53:22 INFO blockmanagement.BlockManager: minReplication             = 1
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsOwner             = root (auth:KERBEROS)
16/03/19 20:53:22 INFO namenode.FSNamesystem: supergroup          = supergroup
16/03/19 20:53:22 INFO namenode.FSNamesystem: isPermissionEnabled = true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Determined nameservice ID: StandbyNameNode
16/03/19 20:53:22 INFO namenode.FSNamesystem: HA Enabled: true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Append Enabled: true
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map INodeMap
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^20 = 1048576 entries
16/03/19 20:53:22 INFO namenode.NameNode: Caching file names occuring more than 10 times
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map cachedBlocks
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^18 = 262144 entries
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry ti                                                                                           me is 600000 millis
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map NameNodeRetryCache
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^15 = 32768 entries
16/03/19 20:53:22 INFO namenode.NNConf: ACLs enabled? false
16/03/19 20:53:22 INFO namenode.NNConf: XAttrs enabled? true
16/03/19 20:53:22 INFO namenode.NNConf: Maximum size of an xattr: 16384
16/03/19 20:53:22 INFO hdfs.StateChange: STATE* Safe mode is ON.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
16/03/19 20:53:22 WARN common.Storage: Storage directory /tmp/hadoop-root/dfs/name does not exist
16/03/19 20:53:22 WARN namenode.FSNamesystem: Encountered exception loading fsimage
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon                                                                                           sistent state: storage directory does not exist or is not accessible.
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: RECOVERY FAILED: caught exception
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon                                                                                           sistent state: storage directory does not exist or is not accessible.
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 ERROR namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon                                                                                           sistent state: storage directory does not exist or is not accessible.
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)

 

I'm looking for advice on how to proceed here!? I really do not know what to do at this point.

 

Thanks for the help.

1 ACCEPTED SOLUTION

avatar
Contributor

Agh, should have been more obvious to me. Was a problem with the keystore and certificates associated with SSL implementation. Additionally, found more documentation specific to implementation, that was not noticed originally. Back up and running now.

View solution in original post

4 REPLIES 4

avatar
Contributor

I decided to try replacing the metadata with the old backup from the day prior to the kerberos implmentation. I then started the cluster. I get the same error upon startup, for namenode txid being slightly lower than expected (just different edit numbers).

 

I went ahead and reverted this so I'm back to using the same metadata as described in the initial post above.

 

At the same time, I can't seem to run any hadoop/hdfs commands to recover/check the system, because it is expecting namenode port to be open and connect to it.

avatar
Contributor

Agh, should have been more obvious to me. Was a problem with the keystore and certificates associated with SSL implementation. Additionally, found more documentation specific to implementation, that was not noticed originally. Back up and running now.

avatar
Community Manager

Congratulations on solving the issue and thank you for sharing the solution. Hopefully it will be of assistance to others with similar circumstances. 


Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
New Contributor

Hi,

 

Congratulations on solving the issue.

I faced the same issue. What did you change for solve?

 

Thanks and regards,