We need your help in better understanding of the issue: The yarn application was failed (based on yarn application logs) due to "out of memory" and the application was submitted to 'default' queue (we are using Capacity Scheduler) and the default queue usage was for today as is below:
If you see in the above, the default queue was allocated with only 5% cluster resources, and most of the time, queue was demanding (waiting) for containers (as indicated under: Pending Containers column) --> so finally the job was failed and when we looked into the application (yarn logs) we noticed that it was failed for 'out of memory' errors.
Now the confusion for me is:
1. The job failed for 'out of memory' --> due to (it never got it's requested memory (as indicated above, the 'pending containers count never went down) --> it means, the jobs never got their requested memory --> so the job was failed for 'out of memory' (may be when they don't get their requested memory, they must be in stuck (state) but should not be failed right?
2. The job in fact was processing more data and may be actually looking for a bigger container size (the default container size was set to 8GB)
I just wanted to get some clear picture here, hence i am looking for your advice. I greatly appreciate your time and help.