Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Frequent NN Failover

Highlighted

Frequent NN Failover

New Contributor

Hi All,

Frequent failover takes place in my cluster (6 node cluster on VMs 16 core/ 64 GB RAM / 500 HDD : each / Storage SAN) post which standby slips into stopped state, until manually started via Ambari. During the operation - stderr shows "Getting JMX metrics from NN failed". Also, I get intermittent Namenode Last check point alert & Namenode High Availability alerts which show the standby NN in unknown state. Sometimes the alerts goes OK after a while or else a failover occurs.

For checkpointing, I have tried reducing the "dfs.namenode.checkpoint.txns" value to little lower value like 100000. Namenode uptime was 2.5 days running all fine until I ran a hive query on the cluster after which the failover occurred, it is persistent now in less than 1 hour twice failover has taken place. And I had to manually start the standby.

Kindly help.



Don't have an account?
Coming from Hortonworks? Activate your account here