Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /data0/dfs/nn

avatar
Expert Contributor

After attempting a large "insert as select" operation, I returned this morning to find that the query had failed and I could not issue any commands to my cluster this morning (e.g. hdfs dfs -df -h). 

 

When logging into CM, I noticed that most nodes had an health issue related to "clock offset".

 

At this point, I am only concerned about trying to recover the data on HDFS.  I am happy to build a new cluster (given that I am on CDH4, anyway) and migrate the data to that new cluster.

 

I tried to restart the cluster but the start-up step failed.  Specifically, it failed to start the HDFS service and reported this error in Log Details:

 

Exception in namenode join
java.io.IOException: Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /data0/dfs/nn state: NOT_FORMATTED
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:295)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:207)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:741)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:531)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:445)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:621)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:606)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1241)

 

Below are some more details that I have gathered about the situation.

 

  • I am running CDH4
  • There are two namenodes in the cluster.  One reporting the errors above and another one which reports
    Unable to trigger a roll of the active NN
    java.net.ConnectException: Call From ip-10-0-0-154.ec2.internal/10.0.0.154 to ip-10-0-0-157.ec2.internal:8022 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
  • If I log into the first name node, the one with the initial error, and try to look at the namenode directory, it is completely empty
ubuntu@ip-10-0-0-157:~$ sudo ls -a /data0/dfs/nn/
.  ..
ubuntu@ip-10-0-0-157:~$ sudo ls -a /data1/dfs/nn/
.  ..
  • If I log into the other name node, it has data in those directories
ubuntu@ip-10-0-0-154:~$ sudo ls -lah  /data0/dfs/nn/
total 12K
drwx------ 3 hdfs hadoop 4.0K Nov  2 22:20 .
drwxr-xr-x 3 root root   4.0K Jun  6  2015 ..
drwxr-xr-x 2 hdfs hdfs   4.0K Nov  2 09:49 current
ubuntu@ip-10-0-0-154:~$ sudo ls -lah  /data1/dfs/nn/
total 12K
drwx------ 3 hdfs hadoop 4.0K Nov  2 22:20 .
drwxr-xr-x 3 root root   4.0K Jun  6  2015 ..
drwxr-xr-x 2 hdfs hdfs   4.0K Nov  2 09:49 current
ubuntu@ip-10-0-0-154:~$ sudo ls -lah  /data0/dfs/nn/current
total 13M
drwxr-xr-x 2 hdfs hdfs   4.0K Nov  2 09:49 .
drwx------ 3 hdfs hadoop 4.0K Nov  2 22:20 ..
-rw-r--r-- 1 hdfs hdfs    697 Jun  6  2015 edits_0000000000000000001-0000000000000000013
-rw-r--r-- 1 hdfs hdfs   1.0M Jun  6  2015 edits_0000000000000000014-0000000000000000913
-rw-r--r-- 1 hdfs hdfs    549 Jun  6  2015 edits_0000000000000000914-0000000000000000923
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000924-0000000000000000937
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000938-0000000000000000951
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000952-0000000000000000965
-rw-r--r-- 1 hdfs hdfs   1.8K Jun  6  2015 edits_0000000000000000966-0000000000000000987
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000988-0000000000000001001
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001002-0000000000000001015
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001016-0000000000000001029
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001030-0000000000000001043
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001044-0000000000000001057
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001058-0000000000000001071
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001072-0000000000000001085
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001086-0000000000000001099
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001100-0000000000000001113
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001114-0000000000000001127
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001128-0000000000000001141
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001142-0000000000000001155
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001156-0000000000000001169
-rw-r--r-- 1 hdfs hdfs   1.0M Jun  6  2015 edits_inprogress_0000000000000001170
-rw-r--r-- 1 hdfs hdfs   5.1M Nov  2 08:49 fsimage_0000000000024545561
-rw-r--r-- 1 hdfs hdfs     62 Nov  2 08:49 fsimage_0000000000024545561.md5
-rw-r--r-- 1 hdfs hdfs   5.1M Nov  2 09:49 fsimage_0000000000024545645
-rw-r--r-- 1 hdfs hdfs     62 Nov  2 09:49 fsimage_0000000000024545645.md5
-rw-r--r-- 1 hdfs hdfs      5 Jun  6  2015 seen_txid
-rw-r--r-- 1 hdfs hdfs    170 Nov  2 09:49 VERSION
1 ACCEPTED SOLUTION

avatar
Expert Contributor

I continued the resolution of this issue in another thread specific to the error: 

 

ls: Operation category READ is not supported in state standby

 

The solution is marked on that thread, however, a quick summary was that I needed to add the Failover Controller role to a node in my cluster, enable Automatic Failover, and then restart the cluster for it all to kick in.

View solution in original post

11 REPLIES 11

avatar
Expert Contributor

Based on this thread, it seems like the following command may be an option.  I will wait for further guidance, though.

 

./hdfs haadmin -transitionToActive <nodename>

 

 

avatar
Expert Contributor

I continued the resolution of this issue in another thread specific to the error: 

 

ls: Operation category READ is not supported in state standby

 

The solution is marked on that thread, however, a quick summary was that I needed to add the Failover Controller role to a node in my cluster, enable Automatic Failover, and then restart the cluster for it all to kick in.