2. Node managers exit (bunch of them) at the same time
My worker nodes only have YARN and HDFS on them and my memory is not overcomitted either.
RM is running on a node that has sufficient memory and does not have YARN on that node.
My nodes are m4.4x instances and I can see it's not being used to full capacity and yet my jobs are slow and get stuck a lot of times.
What could be the issue? My jobs are run through oozie and I have allocated 2gb to . it. NM and SNN has 5gb memory.All my services are in healthy state except for node managers running into unexpected exits and I am not able to find out why is that happening.
Has the NM log any clue at the end? or just exits without even doing any internal shutdown attempt?
Do you have logs created with a name like hs_err_pid.log? If the responses from above are no, yes, yes, then you might have a JVM crash, and the hs_err_pid.log can lead to what is causing it. Otherwise, the NM log should have good clues around the end on why the NM exited.