Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Job stuck in Accepted state under a specific pool

Job stuck in Accepted state under a specific pool

Explorer

Hi

 

I am using Yarn with dymamic resource policy and 3 pools.

1)QA   - 33.3% of cluster capacity

2)DEV - 16.7% of cluster capacity

3)Adhoc - 50% of cluster capacity

 

It was working fine but suddenly i can see strange behaviour. Resouces are available but job is stuck in Accepted state when submitting a job to a specific queue.(I am running both MR and Spark jobs)

As soon as i change the pool name and resubmit the job, it starts running normally with the given memory and vcore limit.

I have not set any user or application limit on any of the queue. I am using fair scheduler with DRF scheduling policy.

 

CDH version = 5.5.2

Please suggest. Any help is appreciated.

14 REPLIES 14

Re: Job stuck in Accepted state under a specific pool

Explorer

Just to elaborate, I can see pending containers in dynamic resource policy table.

I do not have any idea why the containers are in pending state as RM shows memory and vcores are free.

Re: Job stuck in Accepted state under a specific pool

Rising Star

Can you post your fair-scheduler.xml here to further identify the issue? You can go to YARN web UI - cluter - scheduler to check the resource usage per pool, see if there is anything suspecious there.

Re: Job stuck in Accepted state under a specific pool

Explorer

Thanks @haibochen

 

Due to the above issue, i changed the config to only 2 queues

1) default

2)qatest

 

unfortunately i am still facing the same. Below is the fair scheduler xml

 

{"defaultMinSharePreemptionTimeout":null,"defaultQueueSchedulingPolicy":"drf","fairSharePreemptionTimeout":null,"queueMaxAMShareDefault":null,"queueMaxAppsDefault":null,"queuePlacementRules":[],"queues":[{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"root","queues":[{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"default","queues":[],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":3.0}],"schedulingPolicy":"drf"},{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"qatest","queues":[],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":{"memory":102400,"vcores":98},"maxRunningApps":null,"minResources":{"memory":34816,"vcores":1},"scheduleName":"default","weight":2.0}],"schedulingPolicy":"drf"}],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":1.0}],"schedulingPolicy":"drf"}],"userMaxAppsDefault":null,"users":[]}

 

 

both the queues are with drf scheduling policy.

Re: Job stuck in Accepted state under a specific pool

Rising Star

You mentioned that once you changed the queue name that the job was submitted with, all went fine. What was the original queue name, what did you change it to? Also, can you please look at the resource usage in YARN Web UI?

Re: Job stuck in Accepted state under a specific pool

Explorer

I use to change it to lower case say from QA to qa. then it went well.

I had queues with limited capacity including default. Yarn RM shows free resources but job does not take containers. If i see dynamic resource table, containers will be under pending column.

 

This happens only with the 1st job of the queue. If that get submitted then next job run fine with fair scheduler policy.

Seems queue name lost somewhere and as soon as i change, it starts working.

Re: Job stuck in Accepted state under a specific pool

Explorer

I took screen shot of resource policy table and RM resource availability. Please ignore striked queue names, i am not using them. Yarn service restart should delete these. Task memory is set to 6G. RM shows 302G + 8G = 310G used and 16G is free. This is quite okay scenario but sometimes job got stuck in Accepted state if we trigger 1st job to qatest queue. 

 

rap.JPG

 

rm.JPG

Re: Job stuck in Accepted state under a specific pool

Explorer

Hi,

 

Seems facing the same issue again. Can please someone look into this. I am attaching SS again.

 

RM.JPG

 

RDP.JPG

 

Why containers are in pending state even if memory is available ? This is the reason which is keeping job in undefined state. It get submitted to AM but it never intializes. 

Re: Job stuck in Accepted state under a specific pool

Rising Star

For you latest post (08-23), that does seem like a real issue. You posted the fair-scheduler.xml previously, did you ever change it since 08-03? The spark job submitted to root.qatest is running actually (State is RUNNING according to your screenshot).

 

Can you explain why you mean by "It get submitted to AM but it never intializes"? Do you mean AM is still initializing?  

 

It may also be helpful to look at the spark job log to see if there is any useful information there.

Highlighted

Re: Job stuck in Accepted state under a specific pool

Explorer
Actually no. I didn't do any config changes. Yarn pool allocation is same as per above fair-scheduler.xml

"The spark job submitted to root.qatest is running actually (State is RUNNING according to your screenshot)." => It shows running but it will always waiting for the task container. AM container get assigned to the job but task container never get assigned. If i see pool usage, containers will be pending.

"It may also be helpful to look at the spark job log to see if there is any useful information there." => Same job is running fine on default queue.

I see no pattern after long monitoring but it never happen with "default" pool and if number of pools are more(i tried with 3-4) then it happens frequently.

Cant see anything wrong with the logs too. I am kind of running out of ideas.
Don't have an account?
Coming from Hortonworks? Activate your account here