Welcome to the Cloudera Community

Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

Map tasks succeeding but not releasing resources

avatar
Expert Contributor

Hello community,

 

I got some issues with some MapReduce jobs I'm running (simple Wordcount or Teragen/Terasort).

The jobs run fine and succeed, but are quite slow. I detected that the Map-Tasks finish after a few seconds, but do not release their containers. Eventually, 60 seconds later, the Application Master kills the containers.

 

What are potential reasons for that behaviour and how can I resolve it?

 

My setup is a Single Node Cloudera Cluster 5.5. ResourceManager got 4 vCPU and 8 GB RAM to allocate, the Map-Tasks are using 1 CPU and 1 GB RAM.

 

Both the NodeManager's and the MapTask's log do not show any conspicuities. No JVM errors, allocated container memory is not exceeded. Here is an extract of the ApplicationMasters Log:

 

 

2015-12-22 10:28:51,569 INFO [IPC Server handler 5 on 37397] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1450544559411_0015_m_000017_0 is : 1.0
2015-12-22 10:28:51,572 INFO [IPC Server handler 4 on 37397] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from attempt_1450544559411_0015_m_000017_0
2015-12-22 10:28:51,573 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1450544559411_0015_m_000017_0 TaskAttempt Transitioned from RUNNING to SUCCESS_FINISHING_CONTAINER
2015-12-22 10:28:51,573 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with attempt attempt_1450544559411_0015_m_000017_0
2015-12-22 10:28:51,573 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1450544559411_0015_m_000017 Task Transitioned from RUNNING to SUCCEEDED2015-12-22 10:28:51,573 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 17
2015-12-22 10:28:51,573 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 17
...
2015-12-22 10:30:01,405 INFO [Ping Checker] org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:attempt_1450544559411_0015_m_000017_0 Timed out after 60 secs
2015-12-22 10:30:01,405 WARN [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Task attempt attempt_1450544559411_0015_m_000017_0 is done from TaskUmbilicalProtocol's point of view. However, it stays in finishing state for too long
2015-12-22 10:30:01,405 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1450544559411_0015_m_000017_0 TaskAttempt Transitioned from SUCCESS_FINISHING_CONTAINER to SUCCESS_CONTAINER_CLEANUP
2015-12-22 10:30:01,406 INFO [ContainerLauncher #8] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_1450544559411_0015_01_000019 taskAttempt attempt_1450544559411_0015_m_000017_0
2015-12-22 10:30:01,407 INFO [ContainerLauncher #8] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING attempt_1450544559411_0015_m_000017_0
2015-12-22 10:30:01,409 INFO [ContainerLauncher #8] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : npshadoop02.cc.de:8041
2015-12-22 10:30:01,425 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1450544559411_0015_m_000017_0 TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2015-12-22 10:30:02,488 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received completed container container_1450544559411_0015_01_000017

2015-12-22 10:30:02,489 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1450544559411_0015_m_000017_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

 

I think especially this line is striking:

 

Task attempt [...] is done from TaskUmbilicalProtocol's point of view. However, it stays in finishing state for too long

 

Best regards,

Benjamin

Who agreed with this topic