03-27-2017 02:06 PM
Our Development VM CDH 5.9 is HDFS HA enabled. Since there are only five nodes so two JournalNodes are running same with NN but another one is on a DataNode. However, last night, we ran into this below issue and took down both NNs.
FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485], stream=QuorumOutputStream starting at txid 9861544))
Sine this is a Dev cluster so no one was using it runing weekend or nights. After extensive search, the potential issue could be NN Garbarge collection pause.
Is there a good approach how to debug and tweak the heap setting? Currently, NN heap setting is 4GB on both. Default time out is 20 seconds (dfs.qjournal.select-input-streams.timeout.ms).
Any help really appreciated that.