Our Development VM CDH 5.9 is HDFS HA enabled. Since there are only five nodes so two JournalNodes are running same with NN but another one is on a DataNode. However, last night, we ran into this below issue and took down both NNs.
FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485, xxx.xx.xx.xx:8485], stream=QuorumOutputStream starting at txid 9861544))
Sine this is a Dev cluster so no one was using it runing weekend or nights. After extensive search, the potential issue could be NN Garbarge collection pause.
Is there a good approach how to debug and tweak the heap setting? Currently, NN heap setting is 4GB on both. Default time out is 20 seconds (dfs.qjournal.select-input-streams.timeout.ms).
When the NameNode flushes the edits to Journal Nodes it maintains the quorum of 20 seconds. The reason you are seeing this Error message is because it took >20 sec for NN to send the edits. This could be because of various reasons i.e NN GC or JVM pause, whether JN is sharing the disks with other roles, network communication issues , slow group lookups etc.
Checking the NameNode logs just before the FATAL message would be a good starting point. Check for Warning messages just before the FATAL error message on NameNode logs.