Support Questions
Find answers, ask questions, and share your expertise

Race Condition where Task is Preempted, but Job fails from InvalidStateTransitonException.

Cloudera Employee

Diagnosing a Race condition where Job Finishes with a FINISHED State, but FAILS because of a org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at COMMITTING. In reviewing the Job logs of the Application All of the other tasks are still RUNNING when the Preemption takes place, but one of the tasks has already completed and is causing this to go into a Invalid state. I'm trying to see if there are any recommended ways to work around this.

Below is a clip of the Job Log:

2016-05-13 12:10:44,686 INFO [ContainerLauncher #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_e594_1462333361639_127115_01_000076 taskAttempt attempt_1462333361639_127115_m_000000_3

2016-05-13 12:10:44,693 INFO [ContainerLauncher #7] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1462333361639_127115_m_000000_3

2016-05-13 12:10:44,694 INFO [ContainerLauncher #7] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : <hostName>:45454

2016-05-13 12:10:44,708 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1462333361639_127115_m_000000_3 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED

2016-05-13 12:10:44,709 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1462333361639_127115_m_000000_3

2016-05-13 12:10:44,710 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1462333361639_127115_m_000000 Task Transitioned from RUNNING to SUCCEEDED

2016-05-13 12:10:44,710 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 1

2016-05-13 12:10:44,711 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1462333361639_127115Job Transitioned from RUNNING to COMMITTING

2016-05-13 12:10:44,712 INFO [CommitterEvent Processor #4] org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler: Processing the event EventType: JOB_COMMIT

2016-05-13 12:10:44,787 INFO [CommitterEvent Processor #4] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1

2016-05-13 12:10:45,098 INFO [CommitterEvent Processor #4] hive.metastore: Trying to connect to metastore with URI thrift://<hostName>:9083

2016-05-13 12:10:45,157 INFO [CommitterEvent Processor #4] hive.metastore: Connected to metastore.

2016-05-13 12:10:45,293 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:4 ContRel:0 HostLocal:0 RackLocal:0

2016-05-13 12:10:45,294 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_e594_1462333361639_127115_01_000076

2016-05-13 12:10:45,295 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:1 CompletedReds:0 ContAlloc:4 ContRel:0 HostLocal:0 RackLocal:0

2016-05-13 12:10:45,295 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1462333361639_127115_m_000000_3 TaskAttempt Transitioned from SUCCEEDED to KILLED

2016-05-13 12:10:45,295 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1462333361639_127115_m_000000_3: Container preempted by scheduler

2016-05-13 12:10:45,297 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved <hostName> to /rack13

2016-05-13 12:10:45,297 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved <hostname> to /rack6

2016-05-13 12:10:45,298 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.util.RackResolver: Resolved <hostname> to /rack6

2016-05-13 12:10:45,298 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1462333361639_127115_m_000000 Task Transitioned from SUCCEEDED to SCHEDULED

2016-05-13 12:10:45,298 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state

org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_TASK_ATTEMPT_COMPLETED at COMMITTING

	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)

	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)

	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)

	at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996)

	at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138)

	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289)

	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285)

	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)

	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)

	at java.lang.Thread.run(Thread.java:744)

2016-05-13 12:10:45,299 ERROR [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Can't handle this event at current state

org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_MAP_TASK_RESCHEDULED at COMMITTING

	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)

	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)

	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)

	at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:996)

	at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:138)

	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1289)

	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1285)

	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)

	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)

	at java.lang.Thread.run(Thread.java:744)

2016-05-13 12:10:45,299 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1462333361639_127115_m_000000_4 TaskAttempt Transitioned from NEW to UNASSIGNED

2016-05-13 12:10:45,302 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1462333361639_127115Job Transitioned from COMMITTING to ERROR

2016-05-13 12:10:45,303 INFO [Thread-87] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: We are finishing cleanly so this is the last retry

2016-05-13 12:10:45,303 INFO [Thread-87] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator isAMLastRetry: true

2016-05-13 12:10:45,303 INFO [Thread-87] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator notified that shouldUnregistered is: true

2016-05-13 12:10:45,304 INFO [Thread-87] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: true

2016-05-13 12:10:45,304 INFO [Thread-87] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: JobHistoryEventHandler notified that forceJobCompletion is true

2016-05-13 12:10:45,304 INFO [Thread-87] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the services

2016-05-13 12:10:45,305 INFO [Thread-87] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping JobHistoryEventHandler. Size of the outstanding queue size is 0