Reply
Explorer
Posts: 33
Registered: ‎07-27-2015

Job stuck in Accepted state under a specific pool

[ Edited ]

Hi

 

I am using Yarn with dymamic resource policy and 3 pools.

1)QA   - 33.3% of cluster capacity

2)DEV - 16.7% of cluster capacity

3)Adhoc - 50% of cluster capacity

 

It was working fine but suddenly i can see strange behaviour. Resouces are available but job is stuck in Accepted state when submitting a job to a specific queue.(I am running both MR and Spark jobs)

As soon as i change the pool name and resubmit the job, it starts running normally with the given memory and vcore limit.

I have not set any user or application limit on any of the queue. I am using fair scheduler with DRF scheduling policy.

 

CDH version = 5.5.2

Please suggest. Any help is appreciated.

Explorer
Posts: 33
Registered: ‎07-27-2015

Re: Job stuck in Accepted state under a specific pool

Just to elaborate, I can see pending containers in dynamic resource policy table.

I do not have any idea why the containers are in pending state as RM shows memory and vcores are free.

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: Job stuck in Accepted state under a specific pool

Can you post your fair-scheduler.xml here to further identify the issue? You can go to YARN web UI - cluter - scheduler to check the resource usage per pool, see if there is anything suspecious there.

Explorer
Posts: 33
Registered: ‎07-27-2015

Re: Job stuck in Accepted state under a specific pool

Thanks @haibochen

 

Due to the above issue, i changed the config to only 2 queues

1) default

2)qatest

 

unfortunately i am still facing the same. Below is the fair scheduler xml

 

{"defaultMinSharePreemptionTimeout":null,"defaultQueueSchedulingPolicy":"drf","fairSharePreemptionTimeout":null,"queueMaxAMShareDefault":null,"queueMaxAppsDefault":null,"queuePlacementRules":[],"queues":[{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"root","queues":[{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"default","queues":[],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":3.0}],"schedulingPolicy":"drf"},{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"qatest","queues":[],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":{"memory":102400,"vcores":98},"maxRunningApps":null,"minResources":{"memory":34816,"vcores":1},"scheduleName":"default","weight":2.0}],"schedulingPolicy":"drf"}],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":1.0}],"schedulingPolicy":"drf"}],"userMaxAppsDefault":null,"users":[]}

 

 

both the queues are with drf scheduling policy.

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: Job stuck in Accepted state under a specific pool

You mentioned that once you changed the queue name that the job was submitted with, all went fine. What was the original queue name, what did you change it to? Also, can you please look at the resource usage in YARN Web UI?

Explorer
Posts: 33
Registered: ‎07-27-2015

Re: Job stuck in Accepted state under a specific pool

I use to change it to lower case say from QA to qa. then it went well.

I had queues with limited capacity including default. Yarn RM shows free resources but job does not take containers. If i see dynamic resource table, containers will be under pending column.

 

This happens only with the 1st job of the queue. If that get submitted then next job run fine with fair scheduler policy.

Seems queue name lost somewhere and as soon as i change, it starts working.

Explorer
Posts: 33
Registered: ‎07-27-2015

Re: Job stuck in Accepted state under a specific pool

I took screen shot of resource policy table and RM resource availability. Please ignore striked queue names, i am not using them. Yarn service restart should delete these. Task memory is set to 6G. RM shows 302G + 8G = 310G used and 16G is free. This is quite okay scenario but sometimes job got stuck in Accepted state if we trigger 1st job to qatest queue. 

 

rap.JPG

 

rm.JPG

Explorer
Posts: 33
Registered: ‎07-27-2015

Re: Job stuck in Accepted state under a specific pool

[ Edited ]

Hi,

 

Seems facing the same issue again. Can please someone look into this. I am attaching SS again.

 

RM.JPG

 

RDP.JPG

 

Why containers are in pending state even if memory is available ? This is the reason which is keeping job in undefined state. It get submitted to AM but it never intializes. 

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: Job stuck in Accepted state under a specific pool

Sorry, was not able to come back to the forum for a while. According to your latest screen shot (08-23), the total amount of memory in the cluster is 312 GB, unless you have changed the cluster since 08-03. Therefore, I think what you experienced on Aug 3rd is expected (Resources are not available in the cluster).  Even if you did have 16GB left in the cluster, the 16GB could be fragmented all across the nodes, so none of the node could run a 6GB container

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: Job stuck in Accepted state under a specific pool

For you latest post (08-23), that does seem like a real issue. You posted the fair-scheduler.xml previously, did you ever change it since 08-03? The spark job submitted to root.qatest is running actually (State is RUNNING according to your screenshot).

 

Can you explain why you mean by "It get submitted to AM but it never intializes"? Do you mean AM is still initializing?  

 

It may also be helpful to look at the spark job log to see if there is any useful information there.