Created 12-01-2017 02:11 PM
Hello,
After a mass disk operation on our test environment, we have lost all the data in /data dir which was assigned as storage directory for Zookeeper, Hadoop and Falcon (the list yet we know)
Since it was our test cluster, data is not important but I don't want to reinstall all the components. I also want to learn how to recover the cluster running from this state.
In /data dir we only have folders but no files.
After struggling a little on ZKFailoverController, I was able to start it with -formatZK flag.
Now however, I am unable to start namenode(s) getting below exception:
10.0.109.12:8485: Directory /hadoop/hdfs/journal/testnamespace is in an inconsistent state: Can't format the storage directory because the current directory is not empty.
I have tried;
- removing lost+found folder on mount root,
- changing ownership of all folders under /data/hadoop/hdfs to hdfs:hadoop
- changing permission of all folders to 777 /data/hadoop/hdfs
PS: I have updated ownership of path /hadoop/hdfs/ which contains journal folder and it led me to move one step forward:
17/12/01 14:20:26 ERROR namenode.NameNode: Failed to start namenode. java.io.IOException: Cannot remove current directory: /data/hadoop/hdfs/namenode/current
PS: I have removed contents of /data/hadoop/hdfs/namenode/current and now it keeps checking 8485 ports of all Journal quorum nodes.
17/12/01 16:04:35 INFO ipc.Client: Retrying connect to server: bigdata2/10.0.109.11:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
and keeps printing below line in hadoop-hdfs-zkfc-bigdata2.out file
Proceed formatting /hadoop-ha/testnamespace? (Y or N) Invalid input:
Do you have any suggestion?
Or should I give up?
Created 12-01-2017 06:23 PM
Since you don't care about data, from an HDFS perspective it is easier to reinstall your cluster. If you insist I can lead you through the recovery steps, but if I were you I would just reinstall at this point.
Created 12-04-2017 10:55 AM
If recovery steps will take more than re-install and/or give me an unstable cluster then its better to reinstall.
What I anticipate from your answer you mean such costs, right?
Created 12-04-2017 01:40 PM
So.. What are the steps for reinstall?
Is there any way to start from only HDP installation but keeping OS level changes as prerequisite and also ambari installation?
Does command ambari-server reset work for that?
Created 12-04-2017 02:15 PM
Stop the Hdfs service if its running.
Start only the journal nodes (as they will need to be made aware of the formatting)
On the namenode (as user hdfs)
# su - hdfs
Format the namenode
$ hadoop namenode -format
Initialize the Edits (for the journal nodes)
$ hdfs namenode -initializeSharedEdits -force
Format Zookeeper (to force zookeeper to reinitialise)
$ hdfs zkfc -formatZK -force
Using Ambari restart the namenode
If you are running HA name node then
On the second namenode Sync (force synch with first namenode)
$ hdfs namenode -bootstrapStandby -force
On every datanode clear the data directory which is already done in your case
Restart the HDFS service
Hope that helps
Created 12-05-2017 08:09 AM
I had tried hadoop namenode -format before but tried again and received the same exception:
17/12/05 09:46:25 ERROR namenode.NameNode: Failed to start namenode. org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 2 exceptions thrown: 10.0.109.11:8485: Directory /hadoop/hdfs/journal/testnamespace is in an inconsistent state: Can't format the storage directory because the current directory is not empty. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.checkEmptyCurrent(Storage.java:482) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:558) at org.apache.hadoop.hdfs.qjournal.server.JNStorage.format(JNStorage.java:185) at org.apache.hadoop.hdfs.qjournal.server.Journal.format(Journal.java:217) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.format(JournalNodeRpcServer.java:145) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.format(QJournalProtocolServerSideTranslatorPB.java:145) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25419) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
This time additionally I deleted content of /hadoop/hdfs/journal/testnamespace but nothing changed. Command ended up with the same exception.
Created 12-05-2017 11:03 AM
Can you delete the entry in zookeeper and restart
# locate zkCli.sh /usr/hdp/2.x.x.x/zookeeper/bin/zkCli.sh # /usr/hdp/2.x.x.x/zookeeper/bin/zkCli.sh [zk: localhost:2181(CONNECTED) 8] ls /hadoop-ha/
You should see something like
[zk: localhost:2181(CONNECTED) 8] ls /hadoop-ha/xxxxx
Delete the Hdfs ha config entry
[zk: localhost:2181(CONNECTED) 1] rmr /hadoop-ha
Validate that there is no hadoop-ha entry,
[zk: localhost:2181(CONNECTED) 2] ls /
Then restart the all components HDFS service. This will create a new ZNode with correct lock(of Failover controller).
Please let me know if that helped.
Created 12-05-2017 02:24 PM
Unfortunately, I couldn't start HDFS services this way, neither. Thank you very much though.