I am facing a strange issue,I have a single node install of CDH 5.4.
I am trying to run spark jobs. I see that only first job runs , and any jobs submiited after the first job get stuck in ACCEPTED state.
What could be th issue? Any limits that I might have accidentally set?
Does FairScheduler take only memory into consideration when making a decision or does it also use vcores? If it can depend upon multiple reasons, then again this may be another CR wherein user can get to know the exact reason (possibly through an API call) as to why an app is in ACCPETED state (such as memory, cores, disk space, queue limits, etc.)
What the FS takes into account depends on the scheduling type that you have chosen: DRF, Fair or FIFO. Default is DRF which takes into account both memory and CPU.
An application that asks for more resources than the cluster can accommodate, i.e I request 100GB for a container and the maximum container can only be 64GB then it will be rejected. However if I ask for 32 GB and the maximum container is 64GB but there is no node that is large enough to handle the 32 GB then it will just sit there forever (YARN-56). If the maximum container size is 64GB but no node can accommodate that container it most likely will just sit there too.
I am not sure what would happen if I request a 32GB container for a queue which has only 16GB as the maximum resources if it will be rejected or just sit there forever. I have not tested that case.
So you might have a misconfiguration or just run into a bug.
BTW: whatever was mentioned for memory is true for vcores also.
I have a single node system just for doing minmial testing. I have this exact situation but my laptop only has 16 GB to provide.
How do i set/configure the container memory to enable a job to run (get past the accepted state) ? Do i raise the container maximum to 16 GB? or Do I raise it to a value it can never provide like 64 GB?
You also need to check on below configuration (If any).
1. Dynamic Resource Pool Configuration > Resource Pools - Check if jobs are exceeding any max values respective of the queue it's being submitted.
2. Dynamic Resource Pool Configuration > User Limits - Check if the maximum number of applications a user can submit simultaneously is crossing the default value (5) or the specified value.