Receiving Namenode High Availability Status transient alerts by saying that unknow host standby and unable to determine the active namenode.
In logs before the error it is saying that "unable to extract JSON from JMX Response" from the script.
The issue is automatically getting solved. we have defualt settings grace alert up to 5 seconds. Please help what is the exact reason for this. No more information in namenode logs just seeing these error in ambari-alerts and agent files.
Ambari uses the following kind of script to make a JMX call to NameNode MBeans in order to determine the "HAState" os the Active & StandBy NameNodes.
MBean Detail to determine the HAState (Example login to determine the HAState used by the script)
# curl -s http://$NAMENODE_HOSTNAME:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem | grep 'tag.HAState' "tag.HAState" : "active",
** Possible Cause of Such Alert: **
However, sometimes it might happen that the NameNode is busy due to Long GC Pause issue or Due to the load on NameNode host/ Network slowness the JMX Call response is not received within the spoecified time and hence we might see that the response is empty (which is not an expected JSON data) hence we might see UNKNOWN state or script failure as "unable to extract JSON from JMX Response"
** Next Action: **
So if you are getting such Alert very frequently then you must check the NameNode GC logs to verify if it is taking longer GC pause? Verify if the NameNode Heap is reaching to it's limit very frequently . (If yes then tune the NN Memory)?
Or if the NameNode host is overloaded (like High CPU or High IO ..etc)
Also take a look at the following kind of logs and search for those failures: /var/log/ambari-server/ambari-alerts.log
Look at NameNode logs to find any symptoms of slowness.
# top # free -m # less /var/log/hadoop/hdfs/gc.log-* # grep -i 'JvmPauseMonitor' /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log | grep -i WARN
Did you find any such indication with your NameNode ? Like Long GC pause or memory reaching to its max limit (or 95%+ sometimes)
Any indication of Long GC pause or System Load on NameNode hosts ? During the time of alert trigger:
- # top
- # free -m
- # less /var/log/hadoop/hdfs/gc.log-*
- # grep -i 'JvmPauseMonitor' /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log | grep -i WARN