Created 03-19-2016 06:25 PM
Following changes to the cluster to implement Kerberos, I tried to start up the cluster. The message I get is:
Failed to start namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 3290752
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:139)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:684)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:589)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:818)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:797)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1493)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 3290751; expected file to go up to 3290785
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:194)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:186)
... 12 more
Just before implementing Kerberos I did take a backup of the metadata on both namenodes, both of which are in HA. I packed the dfs.data.dir "/dfs/nn/" directory into a tar file. I did the same just now.
Ok so I tried first to use hdfs fsck command, and variations of it fails: " sudo hadoop fsck / " gives me:
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
16/03/19 21:07:45 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over servername.domain.domain/xxx.xxx.xx.xxx:8020 after 1 fail over attempts. Trying to fail over after sleeping for 641ms.
java.net.ConnectException: Call From servername.domain.domain/xxx.xxx.xx.xxx to servername.domain.domain:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
I then decide to try using "namenode -recover"
16/03/19 20:53:05 INFO namenode.NameNode: createNameNode [-recover]
You have selected Metadata Recovery mode. This mode is intended to recover lost metadata on a corrupt filesystem. Metadata recovery mode often permanently deletes data from your HDFS filesystem. Please back up your edit log and fsimage before trying this!
Are you ready to proceed? (Y/N)
(Y or N) y
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: starting recovery...
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Be ware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 WARN namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) con figured. Beware of data loss due to lack of redundant storage directories!
16/03/19 20:53:22 INFO namenode.FSNamesystem: No KeyProvider found.
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsLock is fair:true
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
16/03/19 20:53:22 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:0 0:00.000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: The block deletion will start around 2016 Mar 19 20:53:22
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map BlocksMap
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^21 = 2097152 entries
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=true
16/03/19 20:53:22 INFO blockmanagement.BlockManager: dfs.block.access.key.update.interval=600 min(s), dfs.block.acces s.token.lifetime=600 min(s), dfs.encrypt.data.transfer.algorithm=null
16/03/19 20:53:22 INFO blockmanagement.BlockManager: defaultReplication = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplication = 512
16/03/19 20:53:22 INFO blockmanagement.BlockManager: minReplication = 1
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
16/03/19 20:53:22 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
16/03/19 20:53:22 INFO blockmanagement.BlockManager: encryptDataTransfer = false
16/03/19 20:53:22 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
16/03/19 20:53:22 INFO namenode.FSNamesystem: fsOwner = root (auth:KERBEROS)
16/03/19 20:53:22 INFO namenode.FSNamesystem: supergroup = supergroup
16/03/19 20:53:22 INFO namenode.FSNamesystem: isPermissionEnabled = true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Determined nameservice ID: StandbyNameNode
16/03/19 20:53:22 INFO namenode.FSNamesystem: HA Enabled: true
16/03/19 20:53:22 INFO namenode.FSNamesystem: Append Enabled: true
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map INodeMap
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^20 = 1048576 entries
16/03/19 20:53:22 INFO namenode.NameNode: Caching file names occuring more than 10 times
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map cachedBlocks
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^18 = 262144 entries
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
16/03/19 20:53:22 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
16/03/19 20:53:22 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
16/03/19 20:53:22 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry ti me is 600000 millis
16/03/19 20:53:22 INFO util.GSet: Computing capacity for map NameNodeRetryCache
16/03/19 20:53:22 INFO util.GSet: VM type = 64-bit
16/03/19 20:53:22 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
16/03/19 20:53:22 INFO util.GSet: capacity = 2^15 = 32768 entries
16/03/19 20:53:22 INFO namenode.NNConf: ACLs enabled? false
16/03/19 20:53:22 INFO namenode.NNConf: XAttrs enabled? true
16/03/19 20:53:22 INFO namenode.NNConf: Maximum size of an xattr: 16384
16/03/19 20:53:22 INFO hdfs.StateChange: STATE* Safe mode is ON.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
16/03/19 20:53:22 WARN common.Storage: Storage directory /tmp/hadoop-root/dfs/name does not exist
16/03/19 20:53:22 WARN namenode.FSNamesystem: Encountered exception loading fsimage
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon sistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 INFO namenode.MetaRecoveryContext: RECOVERY FAILED: caught exception
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon sistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1387)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1477)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1561)
16/03/19 20:53:22 ERROR namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an incon sistent state: storage directory does not exist or is not accessible.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:202)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
I'm looking for advice on how to proceed here!? I really do not know what to do at this point.
Thanks for the help.
Created 04-15-2016 05:44 AM
Agh, should have been more obvious to me. Was a problem with the keystore and certificates associated with SSL implementation. Additionally, found more documentation specific to implementation, that was not noticed originally. Back up and running now.
Created 03-20-2016 04:57 AM
I decided to try replacing the metadata with the old backup from the day prior to the kerberos implmentation. I then started the cluster. I get the same error upon startup, for namenode txid being slightly lower than expected (just different edit numbers).
I went ahead and reverted this so I'm back to using the same metadata as described in the initial post above.
At the same time, I can't seem to run any hadoop/hdfs commands to recover/check the system, because it is expecting namenode port to be open and connect to it.
Created 04-15-2016 05:44 AM
Agh, should have been more obvious to me. Was a problem with the keystore and certificates associated with SSL implementation. Additionally, found more documentation specific to implementation, that was not noticed originally. Back up and running now.
Created 04-15-2016 06:26 AM
Congratulations on solving the issue and thank you for sharing the solution. Hopefully it will be of assistance to others with similar circumstances.
Created 11-16-2017 12:22 AM
Hi,
Congratulations on solving the issue.
I faced the same issue. What did you change for solve?
Thanks and regards,