Created on 08-02-2016 03:25 AM - edited 08-02-2016 03:26 AM
Hi
I am using Yarn with dymamic resource policy and 3 pools.
1)QA - 33.3% of cluster capacity
2)DEV - 16.7% of cluster capacity
3)Adhoc - 50% of cluster capacity
It was working fine but suddenly i can see strange behaviour. Resouces are available but job is stuck in Accepted state when submitting a job to a specific queue.(I am running both MR and Spark jobs)
As soon as i change the pool name and resubmit the job, it starts running normally with the given memory and vcore limit.
I have not set any user or application limit on any of the queue. I am using fair scheduler with DRF scheduling policy.
CDH version = 5.5.2
Please suggest. Any help is appreciated.
Created 08-02-2016 03:31 AM
Just to elaborate, I can see pending containers in dynamic resource policy table.
I do not have any idea why the containers are in pending state as RM shows memory and vcores are free.
Created 08-02-2016 08:14 AM
Can you post your fair-scheduler.xml here to further identify the issue? You can go to YARN web UI - cluter - scheduler to check the resource usage per pool, see if there is anything suspecious there.
Created 08-02-2016 08:43 AM
Thanks @haibochen
Due to the above issue, i changed the config to only 2 queues
1) default
2)qatest
unfortunately i am still facing the same. Below is the fair scheduler xml
{"defaultMinSharePreemptionTimeout":null,"defaultQueueSchedulingPolicy":"drf","fairSharePreemptionTimeout":null,"queueMaxAMShareDefault":null,"queueMaxAppsDefault":null,"queuePlacementRules":[],"queues":[{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"root","queues":[{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"default","queues":[],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":3.0}],"schedulingPolicy":"drf"},{"aclAdministerApps":"*","aclSubmitApps":"*","minSharePreemptionTimeout":null,"name":"qatest","queues":[],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":{"memory":102400,"vcores":98},"maxRunningApps":null,"minResources":{"memory":34816,"vcores":1},"scheduleName":"default","weight":2.0}],"schedulingPolicy":"drf"}],"schedulablePropertiesList":[{"impalaMaxMemory":null,"impalaMaxQueuedQueries":null,"impalaMaxRunningQueries":null,"maxAMShare":null,"maxResources":null,"maxRunningApps":null,"minResources":null,"scheduleName":"default","weight":1.0}],"schedulingPolicy":"drf"}],"userMaxAppsDefault":null,"users":[]}
both the queues are with drf scheduling policy.
Created 08-02-2016 09:00 AM
You mentioned that once you changed the queue name that the job was submitted with, all went fine. What was the original queue name, what did you change it to? Also, can you please look at the resource usage in YARN Web UI?
Created 08-02-2016 10:12 AM
I use to change it to lower case say from QA to qa. then it went well.
I had queues with limited capacity including default. Yarn RM shows free resources but job does not take containers. If i see dynamic resource table, containers will be under pending column.
This happens only with the 1st job of the queue. If that get submitted then next job run fine with fair scheduler policy.
Seems queue name lost somewhere and as soon as i change, it starts working.
Created 08-03-2016 05:50 AM
I took screen shot of resource policy table and RM resource availability. Please ignore striked queue names, i am not using them. Yarn service restart should delete these. Task memory is set to 6G. RM shows 302G + 8G = 310G used and 16G is free. This is quite okay scenario but sometimes job got stuck in Accepted state if we trigger 1st job to qatest queue.
Created on 08-23-2016 04:42 AM - edited 08-23-2016 07:53 AM
Hi,
Seems facing the same issue again. Can please someone look into this. I am attaching SS again.
Why containers are in pending state even if memory is available ? This is the reason which is keeping job in undefined state. It get submitted to AM but it never intializes.
Created 08-31-2016 01:26 PM
For you latest post (08-23), that does seem like a real issue. You posted the fair-scheduler.xml previously, did you ever change it since 08-03? The spark job submitted to root.qatest is running actually (State is RUNNING according to your screenshot).
Can you explain why you mean by "It get submitted to AM but it never intializes"? Do you mean AM is still initializing?
It may also be helpful to look at the spark job log to see if there is any useful information there.
Created 08-31-2016 11:17 PM