Created on 11-02-2017 09:30 AM - edited 11-02-2017 03:17 PM
After attempting a large "insert as select" operation, I returned this morning to find that the query had failed and I could not issue any commands to my cluster this morning (e.g. hdfs dfs -df -h).
When logging into CM, I noticed that most nodes had an health issue related to "clock offset".
At this point, I am only concerned about trying to recover the data on HDFS. I am happy to build a new cluster (given that I am on CDH4, anyway) and migrate the data to that new cluster.
I tried to restart the cluster but the start-up step failed. Specifically, it failed to start the HDFS service and reported this error in Log Details:
Exception in namenode join
java.io.IOException: Cannot start an HA namenode with name dirs that need recovery. Dir: Storage Directory /data0/dfs/nn state: NOT_FORMATTED
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:295)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:207)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:741)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:531)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:445)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:621)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:606)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1177)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1241)
Below are some more details that I have gathered about the situation.
Unable to trigger a roll of the active NN
java.net.ConnectException: Call From ip-10-0-0-154.ec2.internal/10.0.0.154 to ip-10-0-0-157.ec2.internal:8022 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
ubuntu@ip-10-0-0-157:~$ sudo ls -a /data0/dfs/nn/ . .. ubuntu@ip-10-0-0-157:~$ sudo ls -a /data1/dfs/nn/ . ..
ubuntu@ip-10-0-0-154:~$ sudo ls -lah /data0/dfs/nn/ total 12K drwx------ 3 hdfs hadoop 4.0K Nov 2 22:20 . drwxr-xr-x 3 root root 4.0K Jun 6 2015 .. drwxr-xr-x 2 hdfs hdfs 4.0K Nov 2 09:49 current ubuntu@ip-10-0-0-154:~$ sudo ls -lah /data1/dfs/nn/ total 12K drwx------ 3 hdfs hadoop 4.0K Nov 2 22:20 . drwxr-xr-x 3 root root 4.0K Jun 6 2015 .. drwxr-xr-x 2 hdfs hdfs 4.0K Nov 2 09:49 current ubuntu@ip-10-0-0-154:~$ sudo ls -lah /data0/dfs/nn/current total 13M drwxr-xr-x 2 hdfs hdfs 4.0K Nov 2 09:49 . drwx------ 3 hdfs hadoop 4.0K Nov 2 22:20 .. -rw-r--r-- 1 hdfs hdfs 697 Jun 6 2015 edits_0000000000000000001-0000000000000000013 -rw-r--r-- 1 hdfs hdfs 1.0M Jun 6 2015 edits_0000000000000000014-0000000000000000913 -rw-r--r-- 1 hdfs hdfs 549 Jun 6 2015 edits_0000000000000000914-0000000000000000923 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000924-0000000000000000937 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000938-0000000000000000951 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000952-0000000000000000965 -rw-r--r-- 1 hdfs hdfs 1.8K Jun 6 2015 edits_0000000000000000966-0000000000000000987 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000000988-0000000000000001001 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001002-0000000000000001015 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001016-0000000000000001029 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001030-0000000000000001043 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001044-0000000000000001057 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001058-0000000000000001071 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001072-0000000000000001085 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001086-0000000000000001099 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001100-0000000000000001113 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001114-0000000000000001127 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001128-0000000000000001141 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001142-0000000000000001155 -rw-r--r-- 1 hdfs hdfs 1.3K Jun 6 2015 edits_0000000000000001156-0000000000000001169 -rw-r--r-- 1 hdfs hdfs 1.0M Jun 6 2015 edits_inprogress_0000000000000001170 -rw-r--r-- 1 hdfs hdfs 5.1M Nov 2 08:49 fsimage_0000000000024545561 -rw-r--r-- 1 hdfs hdfs 62 Nov 2 08:49 fsimage_0000000000024545561.md5 -rw-r--r-- 1 hdfs hdfs 5.1M Nov 2 09:49 fsimage_0000000000024545645 -rw-r--r-- 1 hdfs hdfs 62 Nov 2 09:49 fsimage_0000000000024545645.md5 -rw-r--r-- 1 hdfs hdfs 5 Jun 6 2015 seen_txid -rw-r--r-- 1 hdfs hdfs 170 Nov 2 09:49 VERSION
Created 11-13-2017 11:31 AM
I continued the resolution of this issue in another thread specific to the error:
ls: Operation category READ is not supported in state standby
The solution is marked on that thread, however, a quick summary was that I needed to add the Failover Controller role to a node in my cluster, enable Automatic Failover, and then restart the cluster for it all to kick in.
Created 11-03-2017 11:38 AM
Based on this thread, it seems like the following command may be an option. I will wait for further guidance, though.
./hdfs haadmin -transitionToActive <nodename>
Created 11-13-2017 11:31 AM
I continued the resolution of this issue in another thread specific to the error:
ls: Operation category READ is not supported in state standby
The solution is marked on that thread, however, a quick summary was that I needed to add the Failover Controller role to a node in my cluster, enable Automatic Failover, and then restart the cluster for it all to kick in.