Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Expert Contributor

In large clusters , sometimes restarting Namenode or a secondary namenode will fail and Ambari will keep trying mltiple times then fail.

One thing can be done quickly is to increase the timeouts of Ambari from 5s to 25s.

In

/var/lib/ambari-server/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py

From this:

  1. @retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail)

To this:

  1. @retry(times=25, sleep_time=25, backoff_factor=2, err_class=Fail)

If it still fail, you can try

  1. @retry(times=50, sleep_time=50, backoff_factor=2, err_class=Fail)

One of the root causes of this maybe SOLR audit logs ( from Ambari Infra ) when creating huge logs that needs to be written to hdfs.

You can clear the logs of NN and SNN here : /var/log/hadoop/hdfs/audit/solr/spool

Becareful on deleting only on Standby NN - then do a failover to delete from the other server. do not delete logs while the namenode is active.

2,402 Views
0 Kudos