My client is experiencing a very interesting problem in their Prod cluster (node labels are enabled). Sometimes users complain that their jobs have been in “ACCEPTED” state for a while (25 minutes in some cases), then the jobs just start and finish in a minute. We noticed this behavior only happens to jobs submitted to the queue that has node labels enabled on it. All other jobs submitted to other queues don’t face this problem.
The node label has 15 nodes in it and it is not exclusive one. And all the queues have “fair” ordering policy.
The queue is not really highly utilized, sometimes it’s as little as 2 jobs to get stuck in ACCEPTED state.