In my dev clusters, there are 2 NodeManagers. It is crashing often for the past few weeks because of memory issues and tried to increase heap size for Node Manager process as temporary workaround (from 512 MB to 6 GB as of now). Before 2 days, it couldn't even able to start with 4 GB after crash and it worked after increasing it to 6 GB. Below graph took from CM showing the heap usage across node (jvm_heap_used_mb_across_nodemanagers metric).
Can you help me in this regard?
Few more observations..
While analysing heap dump for killed jvm of NodeManager process, come to know that DeletionService.java (Hash Map) is taking huge amount of memory for some reasons. Can you look into this?
I think I'm also running into this problem. I found my NodeManagers were occasionally being sent SIGKILL from Cloudera's killparent.sh script which is run when NM receives an OutOfMemoryException. In Cloudera Manager, I don't see JVM memory usage trending up, so it's a bit of a mystery why it suddenly receives OOM when a second before, it was well below the limit.
Anyway, please share if you find anything... I will as well!