This article describes why SBNN crashes with an OOM error when audit to Solr is enabled in the cluster, and the way to resolve this issue.
Active NameNode gets crashed and becomes Standby Name Node (SBNN), and further fails to start with no further logs updated in .log. The following traces are found from the .out file:
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit # -XX:OnOutOfMemoryError=""/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node"
1. The issue here is, if the audit to Solr does not work, it will cause the Spool folder to be filled up and Namenode will become busy. If the audit framework detects that an audit destination is down, it buffers the audit messages in memory. Once the memory buffer fills up, it can be configured to spool the unsent messages to disk files [Namenode log dir] to prevent or minimize the loss of audit messages. We can see frequent"Destination is Down"traces in the active NameNode log file in case the audit to Solr does not work.
2. Please note if the audit to Solr does not work, this may further fill Active NN log dir also and crash Active NameNode.
> If this issue persists it may cause fast filling of Active NameNode log dir also and further crash the Active NN also due to "100% utilisation of log directory mount point".
To bring back SBNN, move all its logs and audit dir into backup sub-directory at /var/log/hadoop/hdfs/old_logs. Also, fix the audit to Solr issue.