Support Questions

Find answers, ask questions, and share your expertise

JOB Stuck in Accepted State

avatar
New Contributor

Hi,

I am facing a strange issue,I have a single node install of CDH 5.4.

I am trying to run spark jobs. I see that only first job runs , and any jobs submiited after the first job get stuck in ACCEPTED state.

 

What could be th issue? Any limits that I might have accidentally set?

 

Thanks,

 Baahu

7 REPLIES 7

avatar
Mentor
Your NodeManager's offered memory resource may be too low for the amount of memory the applications/jobs are demanding. This is a common situation that leads to a job waiting in ACCEPTED state, awaiting more resources to run.

You can raise the CM -> YARN -> Configuration -> "Container Memory" field values to higher numbers to resolve this.

This problem is typically also only seen on small installations such as 1-3 nodes.

avatar
Contributor

Does FairScheduler take only memory into consideration when making a decision or does it also use vcores? If it can depend upon multiple reasons, then again this may be another CR wherein user can get to know the exact reason (possibly through an API call) as to why an app is in ACCPETED state (such as memory, cores, disk space, queue limits, etc.)

avatar
Mentor
CPUs are considered equally, if the request seeks that.

avatar
Super Collaborator

What the FS takes into account depends on the scheduling type that you have chosen: DRF, Fair or FIFO. Default is DRF which takes into account both memory and CPU.

 

An application that asks for more resources than the cluster can accommodate, i.e I request 100GB for a container and the maximum container can only be 64GB then it will be rejected. However if I ask for 32 GB and the maximum container is 64GB but there is no node that is large enough to handle the 32 GB then it will just sit there forever (YARN-56). If the maximum container size is 64GB but no node can accommodate that container it most likely will just sit there too.

I am not sure what would happen if I request a 32GB container for a queue which has only 16GB as the maximum resources if it will be rejected or just sit there forever. I have not tested that case.

So you might have a misconfiguration or just run into a bug.

 

 

BTW: whatever was mentioned for memory is true for vcores also.

 

Wilfred

avatar
Explorer

I have a single node system just for doing minmial testing. I have this exact situation but my laptop only has 16 GB to provide.

 

How do i set/configure the container memory to enable a job to run (get past the accepted state) ? Do i raise the container maximum to 16 GB? or Do I raise it to a value it can never provide like 64 GB?

avatar
Explorer

Good Job, Nice explanation and logic.

avatar
New Contributor

Hi,

 

You also need to check on below configuration (If any).

 

1. Dynamic Resource Pool Configuration > Resource Pools - Check if jobs are exceeding any max values respective of the queue it's being submitted.

 

2. Dynamic Resource Pool Configuration > User Limits - Check if the maximum number of applications a user can submit simultaneously is crossing the default value (5) or the specified value.