Support Questions

sedatkestepe · ‎12-01-2017

Hello,

After a mass disk operation on our test environment, we have lost all the data in /data dir which was assigned as storage directory for Zookeeper, Hadoop and Falcon (the list yet we know)

Since it was our test cluster, data is not important but I don't want to reinstall all the components. I also want to learn how to recover the cluster running from this state.

In /data dir we only have folders but no files.

After struggling a little on ZKFailoverController, I was able to start it with -formatZK flag.

Now however, I am unable to start namenode(s) getting below exception:

10.0.109.12:8485: Directory /hadoop/hdfs/journal/testnamespace is in an inconsistent state: Can't format the storage directory because the current directory is not empty.

I have tried;

- removing lost+found folder on mount root,

- changing ownership of all folders under /data/hadoop/hdfs to hdfs:hadoop

- changing permission of all folders to 777 /data/hadoop/hdfs

PS: I have updated ownership of path /hadoop/hdfs/ which contains journal folder and it led me to move one step forward:

17/12/01 14:20:26 ERROR namenode.NameNode: Failed to start namenode. java.io.IOException: Cannot remove current directory: /data/hadoop/hdfs/namenode/current

PS: I have removed contents of /data/hadoop/hdfs/namenode/current and now it keeps checking 8485 ports of all Journal quorum nodes.

17/12/01 16:04:35 INFO ipc.Client: Retrying connect to server: bigdata2/10.0.109.11:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)

and keeps printing below line in hadoop-hdfs-zkfc-bigdata2.out file

Proceed formatting /hadoop-ha/testnamespace? (Y or N) Invalid input:

Do you have any suggestion?

Or should I give up?

aengineer · ‎12-01-2017

@Sedat Kestepe

Since you don't care about data, from an HDFS perspective it is easier to reinstall your cluster. If you insist I can lead you through the recovery steps, but if I were you I would just reinstall at this point.

sedatkestepe · ‎12-04-2017

If recovery steps will take more than re-install and/or give me an unstable cluster then its better to reinstall.

What I anticipate from your answer you mean such costs, right?

sedatkestepe · ‎12-04-2017

So.. What are the steps for reinstall?

Is there any way to start from only HDP installation but keeping OS level changes as prerequisite and also ambari installation?

Does command ambari-server reset work for that?

Shelton · ‎12-04-2017

@Sedat Kestepe

Stop the Hdfs service if its running.

Start only the journal nodes (as they will need to be made aware of the formatting)

On the namenode (as user hdfs)

# su - hdfs

Format the namenode

$ hadoop namenode -format

Initialize the Edits (for the journal nodes)

$ hdfs namenode -initializeSharedEdits -force

Format Zookeeper (to force zookeeper to reinitialise)

$ hdfs zkfc -formatZK -force

Using Ambari restart the namenode

If you are running HA name node then

On the second namenode Sync (force synch with first namenode)

$ hdfs namenode -bootstrapStandby -force

On every datanode clear the data directory which is already done in your case

Restart the HDFS service

Hope that helps

sedatkestepe · ‎12-05-2017

Hi @Geoffrey Shelton Okot,

I had tried hadoop namenode -format before but tried again and received the same exception:

17/12/05 09:46:25 ERROR namenode.NameNode: Failed to start namenode. org.apache.hadoop.hdfs.qjournal.client.QuorumException: Could not format one or more JournalNodes. 2 exceptions thrown: 10.0.109.11:8485: Directory /hadoop/hdfs/journal/testnamespace is in an inconsistent state: Can't format the storage directory because the current directory is not empty. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.checkEmptyCurrent(Storage.java:482) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:558) at org.apache.hadoop.hdfs.qjournal.server.JNStorage.format(JNStorage.java:185) at org.apache.hadoop.hdfs.qjournal.server.Journal.format(Journal.java:217) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.format(JournalNodeRpcServer.java:145) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.format(QJournalProtocolServerSideTranslatorPB.java:145) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25419) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

This time additionally I deleted content of /hadoop/hdfs/journal/testnamespace but nothing changed. Command ended up with the same exception.

Shelton · ‎12-05-2017

@Sedat Kestepe

Can you delete the entry in zookeeper and restart

# locate zkCli.sh
/usr/hdp/2.x.x.x/zookeeper/bin/zkCli.sh
# /usr/hdp/2.x.x.x/zookeeper/bin/zkCli.sh 
[zk: localhost:2181(CONNECTED) 8] ls /hadoop-ha/

You should see something like

[zk: localhost:2181(CONNECTED) 8] ls /hadoop-ha/xxxxx

Delete the Hdfs ha config entry

[zk: localhost:2181(CONNECTED) 1] rmr /hadoop-ha

Validate that there is no hadoop-ha entry,

[zk: localhost:2181(CONNECTED) 2] ls /

Then restart the all components HDFS service. This will create a new ZNode with correct lock(of Failover controller).

Please let me know if that helped.

sedatkestepe · ‎12-05-2017

@Geoffrey Shelton Okot

Unfortunately, I couldn't start HDFS services this way, neither. Thank you very much though.

Cloudera Community

Support Questions

Can not start HDFS which its data was deleted externally