Frequent failover takes place in my cluster (6 node cluster on VMs 16 core/ 64 GB RAM / 500 HDD : each / Storage SAN) post which standby slips into stopped state, until manually started via Ambari. During the operation - stderr shows "Getting JMX metrics from NN failed". Also, I get intermittent Namenode Last check point alert & Namenode High Availability alerts which show the standby NN in unknown state. Sometimes the alerts goes OK after a while or else a failover occurs.
For checkpointing, I have tried reducing the "dfs.namenode.checkpoint.txns" value to little lower value like 100000. Namenode uptime was 2.5 days running all fine until I ran a hive query on the cluster after which the failover occurred, it is persistent now in less than 1 hour twice failover has taken place. And I had to manually start the standby.