Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Expert Contributor

In large clusters , sometimes restarting Namenode or a secondary namenode will fail and Ambari will keep trying multiple times then fails.

One thing can be done quickly is to increase the timeouts of Ambari from 5s to 25s ( or up to 50s )

In

/var/lib/ambari-server/resources/common-services/HDFS/XXX-VERSION-XXX/package/scripts/hdfs_namenode.py

From this:

  1. @retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail)

To this:

  1. @retry(times=25, sleep_time=25, backoff_factor=2, err_class=Fail)

If it still fail, you can try

  1. @retry(times=50, sleep_time=50, backoff_factor=2, err_class=Fail)

One of the root causes of this maybe SOLR audit logs ( from Ambari Infra ) when creating huge logs that needs to be written to hdfs.

Restart Ambari server

You can clear the logs of NN and SNN here : /var/log/hadoop/hdfs/audit/solr/spool

Becareful on deleting only on Standby NN - then do a failover to delete from the other server. do not delete logs while the namenode is active.

971 Views
0 Kudos