By looking at the log, there were no reducers running, so the job is stuck in running mappers. But there is still head room as indicated by "headroom=<memory:420864, vCores:235>" How much resource are you requesting for each Mapper? Can you post your job config here? Mapreduce container allocation has some issues, and RM could also have bugs leading to this problem. If plausible, can you turn the AM log level to Debug and upload the full log later on. Based on all the information I can get here, I am able to rule out the mapper and reducer deadlock issue, but still cannot pinpoint why it is stuck in map phase.
Is there a way to attach the file?
Each data node has 24 cores, 64 GB, 6 drives x 2 TB.
NodeManager is allocated 45GB; Mappers are allocated 4GB (3.2GB Heap); Reducers are allocated 8GB (6.4GB Heap); AM is allocated 8GB (6.4 GB Heap).
Didn't see anywhere we can upload a file unfortunately. Can you provide an external link to the log files here?
Thanks for providing the logs. Looked at the AM logs, noticed two things:
1. 64 TaskAttempt time out after they report progress of 1.0, indicating there could be node manager failure or network issue.
2. The job only got two allocations of resources, 64 containers totally.
I suspect there is node manager failure/resource manager issues, so the job can never get resources, therefore is stuck in map phrase. Can you also link some node manager logs as well as the resource manager logs? Since you are now running on test cluster, if you'd really like to get to the bottom of the issue, is it possible for you to turn all the log levels to Debug?
I ran the job overnight, and it never completed. But, it did take down the YARN ResourceManager and multiple NodeManagers after 5 or 6 hours. Out of 450 mappers, only 64 completed, 386 pending, and 0 running. The pending mappers are in a Scheduled state.
Here are the logs I can get. I put the level of logging to debug. I included the log of completed NodeManager.
We know now that the custom MapReduce jobs are not causing the issue. We decided to upgrade our 2nd production cluster to CDH 5.5.2, and the same things happened again. This cluster is used to process users' hive queries and jobs mainly for data analysis. What we saw is all the mappers would finish, but only half the reducers would finish, the other half would be pending, and none running. This job would block all the following jobs leaving them in a pending state. After killing the blocking job, the pending jobs would go through. This would happen randomly over and over. There was one time that 2 jobs were stopped and blocking. Killing those 2 let other jobs go through. I have a feeling that there is some change that happened to YARN between CDH 5.4.8 and CDH 5.5.2 that is the cause. I have come to believe that it might be the scheduler not letting pending tasks to start. What are your thoughts? I have gathered the logs of one of these blocking jobs.
I hope you can give me some advice or hints to where to go for a solution.
"This cluster is used to process users' hive queries and jobs mainly for data analysis. What we saw is all the mappers would finish, but only half the reducers would finish, the other half would be pending, and none running. This job would block all the following jobs leaving them in a pending state. After killing the blocking job, the pending jobs would go through" Each though it sounds like a deadlock problem, I guess it is caused by something else because it is not rare according to your description.
From the logs you uploaed, I noticed one common sympotom in both two occassions, that is, a lot of containers expiration messages in both AM and Resource Manager roughly around the same time window. Strangely, the containers seemed to be stuck after they reported to AM that they are done.
Looking at the node manager log you uploaded (the one I with DEBUG level), noticed a lot of Error while deleting some application logs. My guess is that all the container were stuck because of this problem, so the cluster can never claim them back, RM could never allocate resource to the jobs because no containers are reported to be available. Though I need to verify if that is indeed this root cause.
The error sample from Node Manager:
2016-04-22 18:34:37,080 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /var/log/hadoop-yarn/container/application_1461345890168_0001 returned with exit code: 255
2016-04-22 18:34:37,082 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Output from LinuxContainerExecutor's deleteAsUser follows:
48177 2016-04-22 18:34:37,082 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command provided 3
48178 2016-04-22 18:34:37,082 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : run as user is nobody
48179 2016-04-22 18:34:37,082 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested yarn user is gxetl
If possible, you could have a single node manager in your test cluster, and keep your debug level of node manager to DEBUG, so by viewing one Node Manager log, I can tell what is happening in the whole cluster on Node Manager side.
I brought the QA cluster down to 2 NodeManagers. Here are the log files for those 2 after the job stalled.