Reply
Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

By looking at the log, there were no reducers running, so the job is stuck in running mappers. But there is still head room as indicated by "headroom=<memory:420864, vCores:235>" How much resource are you requesting for each Mapper? Can you post your job config here? Mapreduce container allocation has some issues, and RM could also have bugs leading to this problem. If plausible, can you turn the AM log level to Debug and upload the full log later on. Based on all the information I can get here, I am able to rule out the mapper and reducer deadlock issue, but still cannot pinpoint why it is stuck in map phase.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

[ Edited ]

Is there a way to attach the file?

 

FYI.

 

Each data node has 24 cores, 64 GB, 6 drives x 2 TB.

NodeManager is allocated 45GB; Mappers are allocated 4GB (3.2GB Heap); Reducers are allocated 8GB (6.4GB Heap); AM is allocated 8GB (6.4 GB Heap).

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Didn't see anywhere we can upload a file unfortunately. Can you provide an external link to the log files here?

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

[ Edited ]
Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Thanks for providing the logs. Looked at the AM logs, noticed two things:

1. 64 TaskAttempt time out after they report progress of 1.0, indicating there could be node manager failure or network issue.

2. The job only got two allocations of resources, 64 containers totally. 

 

I suspect there is node manager failure/resource manager issues, so the job can never get resources, therefore is stuck in map phrase. Can you also link some node manager logs as well as the resource manager logs? Since you are now running on test cluster, if you'd really like to get to the bottom of the issue, is it possible for you to turn all the log levels to Debug?

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

I ran the job overnight, and it never completed. But, it did take down the YARN ResourceManager and multiple NodeManagers after 5 or 6 hours. Out of 450 mappers, only 64 completed, 386 pending, and 0 running. The pending mappers are in a Scheduled state.

 

Here are the logs I can get. I put the level of logging to debug. I included the log of completed NodeManager.

https://dl.dropboxusercontent.com/u/4522904/hadoop-cmf-yarn2-NODEMANAGER-prod-dc1-datanode158.pdc1i....

https://dl.dropboxusercontent.com/u/4522904/hadoop-cmf-yarn2-RESOURCEMANAGER-prod-dc1-datanode151.pd...

https://dl.dropboxusercontent.com/u/4522904/stderr

https://dl.dropboxusercontent.com/u/4522904/syslog

https://dl.dropboxusercontent.com/u/4522904/nm_syslog

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Hi again,

 

We know now that the custom MapReduce jobs are not causing the issue. We decided to upgrade our 2nd production cluster to CDH 5.5.2, and the same things happened again. This cluster is used to process users' hive queries and jobs mainly for data analysis. What we saw is all the mappers would finish, but only half the reducers would finish, the other half would be pending, and none running. This job would block all the following jobs leaving them in a pending state. After killing the blocking job, the pending jobs would go through. This would happen randomly over and over. There was one time that 2 jobs were stopped and blocking. Killing those 2 let other jobs go through. I have a feeling that there is some change that happened to YARN between CDH 5.4.8 and CDH 5.5.2 that is the cause. I have come to believe that it might be the scheduler not letting pending tasks to start. What are your thoughts? I have gathered the logs of one of these blocking jobs.

 

https://dl.dropboxusercontent.com/u/4522904/hadoop-cmf-yarn-RESOURCEMANAGER-mint-ha1.lax.adconion.co...

https://dl.dropboxusercontent.com/u/4522904/hadoop-cmf-yarn-RESOURCEMANAGER-mint-rm.lax.adconion.com...

https://dl.dropboxusercontent.com/u/4522904/hadoop-cmf-yarn-NODEMANAGER-mint-dn31.lax.adconion.com.l...

https://dl.dropboxusercontent.com/u/4522904/am_syslog_2

https://dl.dropboxusercontent.com/u/4522904/am_stderr_2

 

I hope you can give me some advice or hints to where to go for a solution.

 

Thanks,

Ben

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

"This cluster is used to process users' hive queries and jobs mainly for data analysis. What we saw is all the mappers would finish, but only half the reducers would finish, the other half would be pending, and none running. This job would block all the following jobs leaving them in a pending state. After killing the blocking job, the pending jobs would go through"  Each though it sounds like a deadlock problem, I guess it is caused by something else because it is not rare according to your description. 

From the logs you uploaed, I noticed one common sympotom in both two occassions, that is, a lot of containers expiration messages in both AM and Resource Manager roughly around the same time window. Strangely, the containers seemed to be stuck after they reported to AM that they are done. 

Looking at the node manager log you uploaded (the one I with DEBUG level), noticed a lot of Error while deleting some application logs. My guess is that all the container were stuck because of this problem, so the cluster can never claim them back, RM could never allocate resource to the jobs because no containers are  reported to be available. Though I need to verify if that is indeed this root cause.

The error sample from Node Manager:

2016-04-22 18:34:37,080 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: DeleteAsUser for /var/log/hadoop-yarn/container/application_1461345890168_0001 returned with exit code: 255
ExitCodeException exitCode=255:▫
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
 at org.apache.hadoop.util.Shell.run(Shell.java:478)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:520)
at org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:295)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
2016-04-22 18:34:37,082 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Output from LinuxContainerExecutor's deleteAsUser follows:
48177 2016-04-22 18:34:37,082 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : command provided 3
48178 2016-04-22 18:34:37,082 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : run as user is nobody
48179 2016-04-22 18:34:37,082 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested yarn user is gxetl

Highlighted
Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

If possible, you could have a single node manager in your test cluster, and keep your debug level of node manager  to DEBUG, so by viewing one Node Manager log, I can tell what is happening in the whole cluster on Node Manager side.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

If this is of any help, for both clusters, I had to turn off Log Aggregation because none of the jobs would start. They would be stuck in pending state and never start. Can this be a symptom?

 

Thanks,

Ben

Announcements