Then all new-coming applications get stuck for nearly 5 hours,but the cluster resource usage is about (600GB,120vCores)，it means，the cluster resource is still sufficient.
Since my cluster scale is not large ,so I exclude the possibility showed in [YARN-4618].
besides that , all the running applications seems never finished, the Yarn RM seems static ,the RM log have no more state change logs about running applications，except for the log about more and more application is submitted and become ACCEPTED,but never fromACCEPTED to RUNNING.
The resource usage of the whole yarn cluster AND of each sinlge queuestay unchangedfor 5 hours, really strange.
The cluster seems like a zombie.
I haved checked the ApplicationMaster log of some running but stucked application ,
2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for MAP job_1507795051888_183385. Report-size will be 4
2017-11-11 09:04:55,957 INFO [IPC Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for REDUCE job_1507795051888_183385. Report-size will be 0
2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1507795051888_183385: ask=6 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:109760, vCores:25> knownNMs=15
2017-11-11 13:58:56,736 INFO [IPC Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11
You can ses that at 2017-11-11 09:04:56,061 It send resource request to ResourceManager but RM allocate zero containers. Then ,no more logs for 5 hours. At 13:58， I have to kill it manually.
After 5 hours , I kill some pending applications and then everything recovered，remaining cluster resources can be allocated again, ResourceManager seems to be alive again.
I have exclude the possibility of the restriction of maxRunningApps and maxAMShare config because they will just affect a single queue, but my problem is that whole yarn cluster application get stuck.
Also , I exclude the possibility of a resourcemanger full gc problem because I check that with gcutil，no full gc happened , resource manager memory is OK.