- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
NameNodes Refuse to Start; Unable to Recover?
- Labels:
-
HDFS
Created 03-19-2016 06:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Following changes to the cluster to implement Kerberos, I tried to start up the cluster. The message I get is:
Failed to start namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 3290752
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:684)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:589)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:818)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:797)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 3290751; expected file to go up to 3290785
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:186)
... 12 more
Just before implementing Kerberos I did take a backup of the metadata on both namenodes, both of which are in HA. I packed the dfs.data.dir "/dfs/nn/" directory into a tar file. I did the same just now.
Ok so I tried first to use hdfs fsck command, and variations of it fails: " sudo hadoop fsck / " gives me:
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
16/03/19 21:07:45 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over servername.domain.domain/xxx.xxx.xx.xxx:8020 after 1 fail over attempts. Trying to fail over after sleeping for 641ms.
java.net.ConnectException: Call From servername.domain.domain/xxx.xxx.xx.xxx to servername.domain.domain:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
I then decide to try using "namenode -recover"
16/03/19 20:53:05 INFO namenode.NameNode: createNameNode [-recover]
You have selected Metadata Recovery mode. This mode is intended to recover lost metadata on a corrupt filesystem. Metadata recovery mode often permanently deletes data from your HDFS filesystem. Please back up your edit log and fsimage before trying this!
Are you ready to proceed? (Y/N)
(Y or N) y
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: starting recovery...
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Be ware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) con figured. Beware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 INFO namenode.FSNamesystem: No KeyProvider found.
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsLock is fair:true
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:0 0:00.000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: The block deletion will start around 2016 Mar 19 20:53:22
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map BlocksMap
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^21 = 2097152 entries
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.key.update.interval=600 min(s), dfs.block.acces s.token.lifetime=600 min(s), dfs.encrypt.data.transfer.algorithm=null
16/03/19 20:53:22 INFO blockmanagement.BlockManager: defaultReplication = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplication = 512
16/03/19 20:53:22 INFO blockmanagement.BlockManager: minReplication = 1
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: encryptDataTransfer = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsOwner = root (auth:KERBEROS)
16/03/19 20:53:22 INFO namenode.FSNamesystem: supergroup = supergroup
16/03/19 20:53:22 INFO namenode.FSNamesystem: isPermissionEnabled = true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Determined nameservice ID: StandbyNameNode
16/03/19 20:53:22 INFO namenode.FSNamesystem: HA Enabled: true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Append Enabled: true
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map INodeMap
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^20 = 1048576 entries
16/03/19 20:53:22 INFO namenode.NameNode: Caching file names occuring more than 10 times
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map cachedBlocks
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^18 = 262144 entries
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry ti me is 600000 millis
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map NameNodeRetryCache
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^15 = 32768 entries
16/03/19 20:53:22 INFO namenode.NNConf: ACLs enabled? false
16/03/19 20:53:22 INFO namenode.NNConf: XAttrs enabled? true
16/03/19 20:53:22 INFO namenode.NNConf: Maximum size of an xattr: 16384
16/03/19 20:53:22 INFO hdfs.StateChange: STATE* Safe mode is ON.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
16/03/19 20:53:22 WARN common.Storage: Storage directory /tmp/hadoop-root/dfs/name does not exist
16/03/19 20:53:22 WARN namenode.FSNamesystem: Encountered exception loading fsimage
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon sistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: RECOVERY FAILED: caught exception
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon sistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 ERROR namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon sistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
I'm looking for advice on how to proceed here!? I really do not know what to do at this point.
Thanks for the help.
Created 04-15-2016 05:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Agh, should have been more obvious to me. Was a problem with the keystore and certificates associated with SSL implementation. Additionally, found more documentation specific to implementation, that was not noticed originally. Back up and running now.
Created 03-20-2016 04:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I decided to try replacing the metadata with the old backup from the day prior to the kerberos implmentation. I then started the cluster. I get the same error upon startup, for namenode txid being slightly lower than expected (just different edit numbers).
I went ahead and reverted this so I'm back to using the same metadata as described in the initial post above.
At the same time, I can't seem to run any hadoop/hdfs commands to recover/check the system, because it is expecting namenode port to be open and connect to it.
Created 04-15-2016 05:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Agh, should have been more obvious to me. Was a problem with the keystore and certificates associated with SSL implementation. Additionally, found more documentation specific to implementation, that was not noticed originally. Back up and running now.
Created 04-15-2016 06:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Congratulations on solving the issue and thank you for sharing the solution. Hopefully it will be of assistance to others with similar circumstances.
Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created 11-16-2017 12:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Congratulations on solving the issue.
I faced the same issue. What did you change for solve?
Thanks and regards,
