Support Questions

epowell · ‎11-07-2017

I currently have one namenode in a 'stopped' state due to a node failure. I am unable to access any data or services on the cluster, as this was the main namenode.

However, there is a second namenode that I am hoping can be used to recover. I have been working on the issue in this thread and currently I all hdfs instances started except for the bad namenode. This seems to have improved the situation as far as node health status but I still can't access any data.

Here is the relevant command and error:

ubuntu@ip-10-0-0-154:~/backup/data1$ hdfs dfs -ls hdfs://10.0.0.154:8020/
ls: Operation category READ is not supported in state standby

In the previous thread, I also pointed out that there was the option to enable automatic failure in CM. I am wondering if that is the best course of action right now. Any help is greatly appreciated.

epowell · ‎11-13-2017

As noted in the previous reply, I did not have any nodes with the Failover Controller role. Importantly, I also had not enabled Automatic Failover despite running in an HA configuration.

I went ahead and added the Failover Controller role to both namenodes - the good one and the bad one.

After that, I attempted enable the Automatic Failover using the link shown in the screenshot from this post. To do that, however, I needed to first start Zookeeper.

At that point, If I recall correctly, the other namenode was still not active but I then restarted the entire cluster and the automatic failover kicked in, using the other namenode as the active one and leaving the bad namenode in a stopped state.

View solution in original post

saranvisa · ‎11-07-2017

@epowell

The issue might be related to the below jira which is opened a long back still in open status

https://issues.apache.org/jira/browse/HDFS-3447

as an alternate way to connect to hdfs, go to hdfs-site.xml and get dfs.nameservices and try to connect to hdfs using namespace as follows, it may help you

hdfs://<ClusterName>-ns/<hdfs_path>

Note: I didn't get a chance to explore this... also not sure how it will respond in old cdh version

epowell · ‎11-07-2017

Thank you for your response.

I followed you advice below but I am getting the error below. This is the same error as when I try a plain 'hdfs dfs -ls' command.

root@ip-10-0-0-154:/home/ubuntu/backup/data1# grep -B 1 -A 2 nameservices /var/run/cloudera-scm-agent/process/9908-hdfs-NAMENODE/hdfs-site.xml 
  <property>
    <name>dfs.nameservices</name>
    <value>nameservice1</value>
  </property>
ubuntu@ip-10-0-0-154:~/backup/data1$ hdfs dfs -ls hdfs://nameservice1/
17/11/08 04:29:50 WARN retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB after 1 fail over attempts. Trying to fail over after sleeping for 796ms.

Also, I should mention that when I go to CM, it shows that my one good namenode is in 'standby'. Would it help to try a command like this?

./hdfs haadmin -transitionToActive <nodename>

A second thing is that CM shows Automatic Failover is not enabled but there is a link to 'Enable' (see screenshot). Maybe this is another option to help the standby node get promoted to active?

Harsh J · ‎11-07-2017

Is the Failover Controller daemon running on the remaining NameNode? If
not, start it up so it may elect its local NameNode into the ACTIVE state

epowell · ‎11-07-2017

I do not know how to check if the "Failover Controller daemon running on the remaining NameNode".

Can you please tell me how to check?

Harsh J · ‎11-07-2017

If you're using Cloudera Manager, you can see the Failover Controller role instances and their states under the HDFS -> Instances tab.

If you're managing CDH without Cloudera Manager, then you can check on the NameNode host(s) with the below command:

$ sudo service hadoop-hdfs-zkfc status

Harsh J · ‎11-07-2017

If you're instead using tarball or an unmanaged installation, the command to run the failover controller is:

$ hadoop-daemon.sh start zkfc

Or for a more interactive style:

$ hdfs zkfc

epowell · ‎11-08-2017

It appears I do not have any nodes with the Failover Controller role. The screenshot below shows the hdfs instances filtered by that role.

epowell · ‎11-13-2017

As noted in the previous reply, I did not have any nodes with the Failover Controller role. Importantly, I also had not enabled Automatic Failover despite running in an HA configuration.

I went ahead and added the Failover Controller role to both namenodes - the good one and the bad one.

After that, I attempted enable the Automatic Failover using the link shown in the screenshot from this post. To do that, however, I needed to first start Zookeeper.

At that point, If I recall correctly, the other namenode was still not active but I then restarted the entire cluster and the automatic failover kicked in, using the other namenode as the active one and leaving the bad namenode in a stopped state.