Created on 07-24-2017 11:28 AM
In large clusters , sometimes restarting Namenode or a secondary namenode will fail and Ambari will keep trying multiple times then fails.
One thing can be done quickly is to increase the timeouts of Ambari from 5s to 25s ( or up to 50s )
In
/var/lib/ambari-server/resources/common-services/HDFS/XXX-VERSION-XXX/package/scripts/hdfs_namenode.py
From this:
To this:
If it still fail, you can try
One of the root causes of this maybe SOLR audit logs ( from Ambari Infra ) when creating huge logs that needs to be written to hdfs.
Restart Ambari server
You can clear the logs of NN and SNN here : /var/log/hadoop/hdfs/audit/solr/spool
Becareful on deleting only on Standby NN - then do a failover to delete from the other server. do not delete logs while the namenode is active.