Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Explorer

2 stack traces were collected. One was collected at a normal condition, e.g. no job was running, rmstack.output.normal. The other one was collected when the Resource Manager was hanging, rmstack.output. Compared the 2 stack traces, the following thread/handler is different between the 2 stack traces. I am not sure if it is useful or not.

 

diffResult.png

From rmstack.output, e.g. the hanging Resource Manager.

"IPC Server handler 46 on 8031" daemon prio=10 tid=0x00007f75a4a5b800 nid=0x6d6c runnable [0x00007f75974e0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x00000007a47b2208> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
 at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
 at org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:109)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1985)

 

From rmstack.output.normal,

"IPC Server handler 46 on 8031" daemon prio=10 tid=0x00007f75a4a5b800 nid=0x6d6c waiting on condition [0x00007f75974e0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x00000007a47b2208> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
 at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
 at org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:109)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1985)

 

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Super Collaborator

The fact that the two traces are different means that there still is something going on inside the RM. Not everything is locked up.

I assume that you have looked for threads that have a Thread.State of BLOCKED or WAITING. TIMED_WAITING threads are OK nothing wrong with those BLOCKED or  WAITING means that something more is going on. BLOCKED is the real bad one. WAITING could be OK for worker threads that are waiting for work to be released from a queue or something like it.They normally use a monitor for that.

 

The larger stack trace snippet that you uploaded does not show anything wrong. The threads that are in WAITING are fine (at least the ones I can see there). Based on this information I can not tell why things are hanging. I have not seen the issue in my local setup and can not reproduce it either.

 

Wilfred

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Explorer

Thanks for looking at the stack trace and the help.

 

There is no BLOCKED thread in the stack trace although many WAITING threads (as you said that they are okay.) The hanging only happens in this particular sequence or environment all the time, e.g. one main job launches many sub-jobs. The hanging will not happen if all the sub-jobs are launched in different

 

I have another question on what to do when onError() gets the over-max memory error. Right now stop() is the only one in the onError(). Do you have any suggestion to add? Reading all the API documents (from different distributions) and only stop() is mentioned. Thanks for any suggestion.

 

 

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Cloudera Employee
When you see this issue, can you track the reserved resources for the duration of running these jobs?
Karthik Kambatla
Software Engineer, Cloudera Inc.
Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Explorer

Thanks for looking at this issue and help. Would you please list the steps? I am not sure what you exactly want.

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Super Collaborator

The reservations for memory and vcores is logged in the normal RM log.

The RM UI also shows it in the main UI at the top under the cluster metrics. There are two values Memory Reserved and Vcores Reserved.

 

Wilfred

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Explorer

Here are the content in RM log. machine2.domain has the RM on it. Are these what you need?

 

2015-04-29 13:38:32,108 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000002 Container Transitioned from ALLOCATED to ACQUIRED

2015-04-29 13:38:33,201 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000003 Container Transitioned from NEW to ALLOCATED

2015-04-29 13:38:33,201 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1428332257288_7669 CONTAINERID=container_1428332257288_7669_01_000003

2015-04-29 13:38:33,201 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1428332257288_7669_01_000003 of capacity <memory:260, vCores:1> on host imachine2.domain:8041, which has 2 containers, <memory:390, vCores:2> used and <memory:7802, vCores:6> available after allocation

2015-04-29 13:38:33,202 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1428332257288_7669_000001 container=Container: [ContainerId: container_1428332257288_7669_01_000003, NodeId: machine2.domain:8041, NodeHttpAddress: machine2.domain:8042, Resource: <memory:260, vCores:1>, Priority: 0, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:650, vCores:4>, usedCapacity=0.03967285, absoluteUsedCapacity=0.03967285, numApps=3, numContainers=4 clusterResource=<memory:16384, vCores:16>

2015-04-29 13:38:33,202 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:910, vCores:5>, usedCapacity=0.055541992, absoluteUsedCapacity=0.055541992, numApps=3, numContainers=5

2015-04-29 13:38:33,202 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.055541992 absoluteUsedCapacity=0.055541992 used=<memory:910, vCores:5> cluster=<memory:16384, vCores:16>

2015-04-29 13:38:33,232 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : machine2.domain:8041 for container : container_1428332257288_7669_01_000003

2015-04-29 13:38:33,232 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000003 Container Transitioned from ALLOCATED to ACQUIRED

2015-04-29 13:38:33,335 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: checking for deactivate...

2015-04-29 13:38:34,056 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000002 Container Transitioned from ACQUIRED to RUNNING

2015-04-29 13:38:34,202 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000003 Container Transitioned from ACQUIRED to RUNNING

2015-04-29 13:38:35,074 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000002 Container Transitioned from RUNNING to COMPLETED

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1428332257288_7669_01_000002 in state: COMPLETED event:FINISHED

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1428332257288_7669 CONTAINERID=container_1428332257288_7669_01_000002

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1428332257288_7669_01_000002 of capacity <memory:260, vCores:1> on host machine1.domain:8041, which currently has 2 containers, <memory:260, vCores:2> used and <memory:7932, vCores:6> available, release resources=true

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:650, vCores:4> numContainers=4 user=yarn user-resources=<memory:650, vCores:4>

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1428332257288_7669_01_000002, NodeId: machine1.domain:8041, NodeHttpAddress: machine1.domain:8042, Resource: <memory:260, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 9.70.151.41:8041 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:650, vCores:4>, usedCapacity=0.03967285, absoluteUsedCapacity=0.03967285, numApps=3, numContainers=4 cluster=<memory:16384, vCores:16>

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=0.03967285 absoluteUsedCapacity=0.03967285 used=<memory:650, vCores:4> cluster=<memory:16384, vCores:16>

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:650, vCores:4>, usedCapacity=0.03967285, absoluteUsedCapacity=0.03967285, numApps=3, numContainers=4

2015-04-29 13:38:35,075 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1428332257288_7669_000001 released container container_1428332257288_7669_01_000002 on node: host: machine1.domain:8041 #containers=2 available=7932 used=260 with event: FINISHED

2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000003 Container Transitioned from RUNNING to COMPLETED

2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1428332257288_7669_01_000003 in state: COMPLETED event:FINISHED

2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1428332257288_7669 CONTAINERID=container_1428332257288_7669_01_000003
2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1428332257288_7669_01_000003 of capacity <memory:260, vCores:1> on host machine2.domain:8041, which currently has 1 containers, <memory:130, vCores:1> used and <memory:8062, vCores:7> available, release resources=true

2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:390, vCores:3> numContainers=3 user=yarn user-resources=<memory:390, vCores:3>

2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1428332257288_7669_01_000003, NodeId: machine2.domain:8041, NodeHttpAddress: machine2.domain:8042, Resource: <memory:260, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 9.70.151.46:8041 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:390, vCores:3>, usedCapacity=0.023803711, absoluteUsedCapacity=0.023803711, numApps=3, numContainers=3 cluster=<memory:16384, vCores:16>

2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=0.023803711 absoluteUsedCapacity=0.023803711 used=<memory:390, vCores:3> cluster=<memory:16384, vCores:16>
2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:390, vCores:3>, usedCapacity=0.023803711, absoluteUsedCapacity=0.023803711, numApps=3, numContainers=3

2015-04-29 13:38:35,206 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1428332257288_7669_000001 released container container_1428332257288_7669_01_000003 on node: host: machine2.domain:8041 #containers=1 available=8062 used=130 with event: FINISHED

2015-04-29 13:38:35,489 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1428332257288_7669_000001
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=8224, maxMemory=8192

 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:196)

 at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:94)

 at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:487)

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Super Collaborator

I completely overlooked the fact that this is not the FairScheduler but the CapacityScheduler. The change that was made has gone into the overarching code for both and we have seen the fix work for the FairScheduler and it recovers from the issue with the standard config. We recommend using the Fairscheduler for a CDH release since we do far more testing, also at scale, and develpoment work on it.

That said: can you show the lines (20-25 at least) just after the error was trhown that should shed some more lihgt on what the scheduler is doing.

 

Wilfred

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Explorer

Thanks for the help. Here is the lines after the error.

 

2015-04-29 13:39:00,927 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:appattempt_1428332257288_7670_000001 (auth:TOKEN) cause:org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=8224, maxMemory=8192
2015-04-29 13:39:00,927 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 8030, call org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 9.70.151.46:35075 Call#296 Retry#0
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=8224, maxMemory=8192
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:196)
	at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:94)
	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:487)
	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
	at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:appattempt_1428332257288_7669_000001 Timed out after 600 secs
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:appattempt_1428332257288_7670_000001 Timed out after 600 secs
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1428332257288_7669_000001 with final state: FAILED, and exit status: -1000
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1428332257288_7669_000001 State change from RUNNING to FINAL_SAVING
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1428332257288_7670_000001 with final state: FAILED, and exit status: -1000
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1428332257288_7670_000001 State change from RUNNING to FINAL_SAVING
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1428332257288_7669_000001
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1428332257288_7669_000001
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1428332257288_7669_000001 State change from FINAL_SAVING to FAILED
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1428332257288_7670_000001
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1428332257288_7670_000001
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1428332257288_7670_000001 State change from FINAL_SAVING to FAILED
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1428332257288_7669 with final state: FAILED
2015-04-29 13:51:00,128 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1428332257288_7669 State change from RUNNING to FINAL_SAVING
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1428332257288_7670 with final state: FAILED
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1428332257288_7670 State change from RUNNING to FINAL_SAVING
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1428332257288_7669
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Attempt appattempt_1428332257288_7669_000001 is done. finalState=FAILED
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning master appattempt_1428332257288_7670_000001
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1428332257288_7670
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7669_01_000001 Container Transitioned from RUNNING to KILLED
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1428332257288_7669_01_000001 in state: KILLED event:KILL
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1428332257288_7669	CONTAINERID=container_1428332257288_7669_01_000001
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1428332257288_7669_01_000001 of capacity <memory:130, vCores:1> on host isblade9.swg.usma.ibm.com:8041, which currently has 1 containers, <memory:130, vCores:1> used and <memory:8062, vCores:7> available, release resources=true
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:260, vCores:2> numContainers=2 user=yarn user-resources=<memory:260, vCores:2>
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1428332257288_7669_01_000001, NodeId: isblade9.swg.usma.ibm.com:8041, NodeHttpAddress: isblade9.swg.usma.ibm.com:8042, Resource: <memory:130, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 9.70.151.41:8041 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:260, vCores:2>, usedCapacity=0.01586914, absoluteUsedCapacity=0.01586914, numApps=3, numContainers=2 cluster=<memory:16384, vCores:16>
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=0.01586914 absoluteUsedCapacity=0.01586914 used=<memory:260, vCores:2> cluster=<memory:16384, vCores:16>
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:260, vCores:2>, usedCapacity=0.01586914, absoluteUsedCapacity=0.01586914, numApps=3, numContainers=2
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1428332257288_7669_000001 released container container_1428332257288_7669_01_000001 on node: host: isblade9.swg.usma.ibm.com:8041 #containers=1 available=8062 used=130 with event: KILL
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1428332257288_7669 requests cleared
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application removed - appId: application_1428332257288_7669 user: yarn queue: default #user-pending-applications: 0 #user-active-applications: 2 #queue-pending-applications: 0 #queue-active-applications: 2
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Attempt appattempt_1428332257288_7670_000001 is done. finalState=FAILED
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1428332257288_7670_01_000001 Container Transitioned from RUNNING to KILLED
2015-04-29 13:51:00,129 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning master appattempt_1428332257288_7669_000001
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Completed container: container_1428332257288_7670_01_000001 in state: KILLED event:KILL
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1428332257288_7670	CONTAINERID=container_1428332257288_7670_01_000001
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1428332257288_7670_01_000001 of capacity <memory:130, vCores:1> on host isblade10.swg.usma.ibm.com:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:8> available, release resources=true
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: default used=<memory:130, vCores:1> numContainers=1 user=yarn user-resources=<memory:130, vCores:1>
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: completedContainer container=Container: [ContainerId: container_1428332257288_7670_01_000001, NodeId: isblade10.swg.usma.ibm.com:8041, NodeHttpAddress: isblade10.swg.usma.ibm.com:8042, Resource: <memory:130, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 9.70.151.46:8041 }, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:130, vCores:1>, usedCapacity=0.00793457, absoluteUsedCapacity=0.00793457, numApps=2, numContainers=1 cluster=<memory:16384, vCores:16>
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=0.00793457 absoluteUsedCapacity=0.00793457 used=<memory:130, vCores:1> cluster=<memory:16384, vCores:16>
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting completed queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:130, vCores:1>, usedCapacity=0.00793457, absoluteUsedCapacity=0.00793457, numApps=2, numContainers=1
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application attempt appattempt_1428332257288_7670_000001 released container container_1428332257288_7670_01_000001 on node: host: isblade10.swg.usma.ibm.com:8041 #containers=0 available=8192 used=0 with event: KILL
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1428332257288_7670 requests cleared
2015-04-29 13:51:00,131 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application removed - appId: application_1428332257288_7670 user: yarn queue: default #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1
2015-04-29 13:51:00,160 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1428332257288_7669 failed 1 times due to ApplicationMaster for attempt appattempt_1428332257288_7669_000001 timed out. Failing the application.
2015-04-29 13:51:00,160 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1428332257288_7669 State change from FINAL_SAVING to FAILED
2015-04-29 13:51:00,177 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1428332257288_7670 failed 1 times due to ApplicationMaster for attempt appattempt_1428332257288_7670_000001 timed out. Failing the application.
2015-04-29 13:51:00,178 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1428332257288_7670 State change from FINAL_SAVING to FAILED
2015-04-29 13:51:00,178 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Container container_1428332257288_7670_01_000001 already scheduled for cleanup, no further processing
2015-04-29 13:51:00,178 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Container container_1428332257288_7669_01_000001 already scheduled for cleanup, no further processing
2015-04-29 13:51:00,178 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn	OPERATION=Application Finished - Failed	TARGET=RMAppManager	RESULT=FAILURE	DESCRIPTION=App failed with state: FAILED	PERMISSIONS=Application application_1428332257288_7669 failed 1 times due to ApplicationMaster for attempt appattempt_1428332257288_7669_000001 timed out. Failing the application.	APPID=application_1428332257288_7669
2015-04-29 13:51:00,178 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1428332257288_7669 user: yarn leaf-queue of parent: root #applications: 2

 

 

Highlighted

Re: CDH 5.3, after the error InvalidResourceRequestException happened, no more other task can be run

Explorer

Thanks for the help. From my preliminary test, the Yarn Resource Manager hanging was not seen after switched to Fair Scheduler. A whole tests will be done tonight to see if there is any othere issue.

Don't have an account?
Coming from Hortonworks? Activate your account here