Support Questions

epowell · ‎11-02-2017

After attempting a large "insert as select" operation, I returned this morning to find that the query had failed and I could not issue any commands to my cluster this morning (e.g. hdfs dfs -df -h).

When logging into CM, I noticed that most nodes had an health issue related to "clock offset".

At this point, I am only concerned about trying to recover the data on HDFS. I am happy to build a new cluster (given that I am on CDH4, anyway) and migrate the data to that new cluster.

I tried to restart the cluster but the start-up step failed. Specifically, it failed to start the HDFS service and reported this error in Log Details:

Exception in namenode join
java.io.IOException: Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /data0/dfs/nn state: NOT_FORMATTED
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:295)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:207)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:741)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:531)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:445)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:621)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:606)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1177)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1241)

Below are some more details that I have gathered about the situation.

I am running CDH4

There are two namenodes in the cluster. One reporting the errors above and another one which reports

Unable to trigger a roll of the active NN
java.net.ConnectException: Call From ip-10-0-0-154.ec2.internal/10.0.0.154 to ip-10-0-0-157.ec2.internal:8022 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

If I log into the first name node, the one with the initial error, and try to look at the namenode directory, it is completely empty

ubuntu@ip-10-0-0-157:~$ sudo ls -a /data0/dfs/nn/
.  ..
ubuntu@ip-10-0-0-157:~$ sudo ls -a /data1/dfs/nn/
.  ..

If I log into the other name node, it has data in those directories

ubuntu@ip-10-0-0-154:~$ sudo ls -lah  /data0/dfs/nn/
total 12K
drwx------ 3 hdfs hadoop 4.0K Nov  2 22:20 .
drwxr-xr-x 3 root root   4.0K Jun  6  2015 ..
drwxr-xr-x 2 hdfs hdfs   4.0K Nov  2 09:49 current
ubuntu@ip-10-0-0-154:~$ sudo ls -lah  /data1/dfs/nn/
total 12K
drwx------ 3 hdfs hadoop 4.0K Nov  2 22:20 .
drwxr-xr-x 3 root root   4.0K Jun  6  2015 ..
drwxr-xr-x 2 hdfs hdfs   4.0K Nov  2 09:49 current
ubuntu@ip-10-0-0-154:~$ sudo ls -lah  /data0/dfs/nn/current
total 13M
drwxr-xr-x 2 hdfs hdfs   4.0K Nov  2 09:49 .
drwx------ 3 hdfs hadoop 4.0K Nov  2 22:20 ..
-rw-r--r-- 1 hdfs hdfs    697 Jun  6  2015 edits_0000000000000000001-0000000000000000013
-rw-r--r-- 1 hdfs hdfs   1.0M Jun  6  2015 edits_0000000000000000014-0000000000000000913
-rw-r--r-- 1 hdfs hdfs    549 Jun  6  2015 edits_0000000000000000914-0000000000000000923
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000924-0000000000000000937
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000938-0000000000000000951
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000952-0000000000000000965
-rw-r--r-- 1 hdfs hdfs   1.8K Jun  6  2015 edits_0000000000000000966-0000000000000000987
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000000988-0000000000000001001
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001002-0000000000000001015
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001016-0000000000000001029
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001030-0000000000000001043
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001044-0000000000000001057
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001058-0000000000000001071
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001072-0000000000000001085
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001086-0000000000000001099
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001100-0000000000000001113
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001114-0000000000000001127
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001128-0000000000000001141
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001142-0000000000000001155
-rw-r--r-- 1 hdfs hdfs   1.3K Jun  6  2015 edits_0000000000000001156-0000000000000001169
-rw-r--r-- 1 hdfs hdfs   1.0M Jun  6  2015 edits_inprogress_0000000000000001170
-rw-r--r-- 1 hdfs hdfs   5.1M Nov  2 08:49 fsimage_0000000000024545561
-rw-r--r-- 1 hdfs hdfs     62 Nov  2 08:49 fsimage_0000000000024545561.md5
-rw-r--r-- 1 hdfs hdfs   5.1M Nov  2 09:49 fsimage_0000000000024545645
-rw-r--r-- 1 hdfs hdfs     62 Nov  2 09:49 fsimage_0000000000024545645.md5
-rw-r--r-- 1 hdfs hdfs      5 Jun  6  2015 seen_txid
-rw-r--r-- 1 hdfs hdfs    170 Nov  2 09:49 VERSION

epowell · ‎11-13-2017

I continued the resolution of this issue in another thread specific to the error:

ls: Operation category READ is not supported in state standby

The solution is marked on that thread, however, a quick summary was that I needed to add the Failover Controller role to a node in my cluster, enable Automatic Failover, and then restart the cluster for it all to kick in.

View solution in original post

epowell · ‎11-03-2017

Based on this thread, it seems like the following command may be an option. I will wait for further guidance, though.

./hdfs haadmin -transitionToActive <nodename>

epowell · ‎11-13-2017

I continued the resolution of this issue in another thread specific to the error:

ls: Operation category READ is not supported in state standby

The solution is marked on that thread, however, a quick summary was that I needed to add the Failover Controller role to a node in my cluster, enable Automatic Failover, and then restart the cluster for it all to kick in.