Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NameNode HA Ambari Display Issue

avatar
Rising Star

Greetings!

So it seems that my configuration is wrong OR Ambari 2.2.2.1 has a refresh issue?

Basically I'm running a cluster with an high availability NN's configuration.

For some reason unknown when NN fails SNN becomes the active node as expected and NN goes into standby once the service is restarted.

I can confirm the failover is successful by running hdfs haadmin -getServiceState nn1 & hdfs haadmin -getServiceState nn2. Respectively from that point nn1 reports Standby and nn2 reports Active.

The funky part however is that on Ambari both NameNodes are marked as Active even though the backend failed over, so Ambari should report NN Standby and SNN Active.

So the DFS can be written to by simply using the typical hdfs dfs -put test.log <path>/test.log

Now to force Ambari to refresh the status I run the following command: echo N | hdfs haadmin -transitionToStandby --forcemanual nn2 and then essentially nn2 is marked as Standby and nn1 becomes Active and Ambari refreshes to display NN as Active and SNN as Standby and the world is happy.....

So from a SysAdmin perspective I can write data to the filesystem and I'm happy and consider that an Ambari bug, however from programmer colleague it causes havok has he can't write/read/modify the file system from Java/API/hdfs://url.

Is this a known issue? Expected behaviour? And last but not least what defines the hdfs://url value ? Is there an additional parameter to add from that url to fresh?

thanks!

Eric

1 ACCEPTED SOLUTION

avatar
@Eric Periard, there is a known issue right now in the way Ambari determines HA status for the NameNodes. Ambari uses a JMX query to each NameNode. The current implementation of that query fetches more data than is strictly necessary for checking HA status, and this can cause delays in processing that query. The symptom of this is that the Ambari UI will misreport the active/standby status of the NameNodes as you described. The problem is intermittent, so a browser refresh is likely to show correct behavior. There is a fix in development now for Ambari to use a lighter-weight JMX query that won't be to prone to this problem.

This does not indicate a problem with the health of HDFS. As you noted, users are still able to read and write files. The problem is limited to the reporting of HA status displayed in Ambari.

View solution in original post

4 REPLIES 4

avatar
@Eric Periard, there is a known issue right now in the way Ambari determines HA status for the NameNodes. Ambari uses a JMX query to each NameNode. The current implementation of that query fetches more data than is strictly necessary for checking HA status, and this can cause delays in processing that query. The symptom of this is that the Ambari UI will misreport the active/standby status of the NameNodes as you described. The problem is intermittent, so a browser refresh is likely to show correct behavior. There is a fix in development now for Ambari to use a lighter-weight JMX query that won't be to prone to this problem.

This does not indicate a problem with the health of HDFS. As you noted, users are still able to read and write files. The problem is limited to the reporting of HA status displayed in Ambari.

avatar
Rising Star

Good to know that I'm not going crazy then....

I have a feeling this is related?

https://issues.apache.org/jira/browse/AMBARI-15235

Mentioned as fixed somewhat in 2.2.2? I'm still on 2.2.1.1.

avatar

@Eric Periard, no, you are not going crazy. 🙂

You're correct that JIRA issue AMBARI-15235 is related. That's a change that helps on the display side. AMBARI-17603 is another patch that gets more at the root cause of the problem by optimizing the JMX query.

avatar
Rising Star

Has there been any progress so far on that issue... I've tried so many approach that I've resorted to making this script that checks the node status every minute...

5724-failover-script.png