Reply
Explorer
Posts: 13
Registered: ‎05-20-2015
Accepted Solution

NameNodes Refuse to Start; Unable to Recover?

Following changes to the cluster to implement Kerberos, I tried to start up the cluster. The message I get is:

 

Failed to start namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0.  Expected transaction ID was 3290752
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:684)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:589)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:818)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:797)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 3290751; expected file to go up to 3290785
    at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
    at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
    at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
    at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
    at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:186)
    ... 12 more

 

Just before implementing Kerberos I did take a backup of the metadata on both namenodes, both of which are in HA. I packed the dfs.data.dir "/dfs/nn/" directory into a tar file. I did the same just now.

 

Ok so I tried first to use hdfs fsck command, and variations of it fails: " sudo hadoop fsck / " gives me:

 

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

16/03/19 21:07:45 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over servername.domain.domain/xxx.xxx.xx.xxx:8020 after 1 fail over attempts. Trying to fail over after sleeping for 641ms.
java.net.ConnectException: Call From servername.domain.domain/xxx.xxx.xx.xxx to servername.domain.domain:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
        at org.apache.hadoop.ipc.Client.call(Client.java:1476)
        at org.apache.hadoop.ipc.Client.call(Client.java:1403)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)

 

I then decide to try using "namenode -recover"

 

16/03/19 20:53:05 INFO namenode.NameNode: createNameNode [-recover]
You have selected Metadata Recovery mode.  This mode is intended to recover lost metadata on a corrupt filesystem.  Metadata recovery mode often permanently deletes data from your HDFS filesystem.  Please back up your edit log and fsimage before trying this!

Are you ready to proceed? (Y/N)
 (Y or N) y
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: starting recovery...
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Be                                                                                           ware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) con                                                                                           figured. Beware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 INFO namenode.FSNamesystem: No KeyProvider found.
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsLock is fair:true
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:0                                                                                           0:00.000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: The block deletion will start around 2016 Mar 19 20:53:22
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map BlocksMap
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^21 = 2097152 entries
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.key.update.interval=600 min(s), dfs.block.acces                                                                                           s.token.lifetime=600 min(s), dfs.encrypt.data.transfer.algorithm=null
16/03/19 20:53:22 INFO blockmanagement.BlockManager: defaultReplication         = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplication             = 512
16/03/19 20:53:22 INFO blockmanagement.BlockManager: minReplication             = 1
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsOwner             = root (auth:KERBEROS)
16/03/19 20:53:22 INFO namenode.FSNamesystem: supergroup          = supergroup
16/03/19 20:53:22 INFO namenode.FSNamesystem: isPermissionEnabled = true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Determined nameservice ID: StandbyNameNode
16/03/19 20:53:22 INFO namenode.FSNamesystem: HA Enabled: true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Append Enabled: true
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map INodeMap
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^20 = 1048576 entries
16/03/19 20:53:22 INFO namenode.NameNode: Caching file names occuring more than 10 times
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map cachedBlocks
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^18 = 262144 entries
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry ti                                                                                           me is 600000 millis
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map NameNodeRetryCache
16/03/19 20:53:22 INFO util.GSet: VM type       = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
16/03/19 20:53:22 INFO util.GSet: capacity      = 2^15 = 32768 entries
16/03/19 20:53:22 INFO namenode.NNConf: ACLs enabled? false
16/03/19 20:53:22 INFO namenode.NNConf: XAttrs enabled? true
16/03/19 20:53:22 INFO namenode.NNConf: Maximum size of an xattr: 16384
16/03/19 20:53:22 INFO hdfs.StateChange: STATE* Safe mode is ON.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
16/03/19 20:53:22 WARN common.Storage: Storage directory /tmp/hadoop-root/dfs/name does not exist
16/03/19 20:53:22 WARN namenode.FSNamesystem: Encountered exception loading fsimage
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon                                                                                           sistent state: storage directory does not exist or is not accessible.
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: RECOVERY FAILED: caught exception
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon                                                                                           sistent state: storage directory does not exist or is not accessible.
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 ERROR namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon                                                                                           sistent state: storage directory does not exist or is not accessible.
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)

 

I'm looking for advice on how to proceed here!? I really do not know what to do at this point.

 

Thanks for the help.

Explorer
Posts: 13
Registered: ‎05-20-2015

Re: NameNodes Refuse to Start; Unable to Recover?

I decided to try replacing the metadata with the old backup from the day prior to the kerberos implmentation. I then started the cluster. I get the same error upon startup, for namenode txid being slightly lower than expected (just different edit numbers).

 

I went ahead and reverted this so I'm back to using the same metadata as described in the initial post above.

 

At the same time, I can't seem to run any hadoop/hdfs commands to recover/check the system, because it is expecting namenode port to be open and connect to it.

Explorer
Posts: 13
Registered: ‎05-20-2015

Re: NameNodes Refuse to Start; Unable to Recover?

Agh, should have been more obvious to me. Was a problem with the keystore and certificates associated with SSL implementation. Additionally, found more documentation specific to implementation, that was not noticed originally. Back up and running now.

Posts: 619
Kudos: 71
Solutions: 36
Registered: ‎04-06-2015

Re: NameNodes Refuse to Start; Unable to Recover?

Congratulations on solving the issue and thank you for sharing the solution. Hopefully it will be of assistance to others with similar circumstances. 

 




Cy Jervis, Community Manager - I'm not an expert but will supply relevant content from time to time. :)

Learn more about the Cloudera Community:


Terms of Service


Community Guidelines


How to use the forum

Highlighted
New Contributor
Posts: 1
Registered: ‎11-16-2017

Re: NameNodes Refuse to Start; Unable to Recover?

Hi,

 

Congratulations on solving the issue.

I faced the same issue. What did you change for solve?

 

Thanks and regards,

Announcements