Reply
Expert Contributor
Posts: 61
Registered: ‎02-03-2016

MapReduce jobs stop executing after upgrading to CDH 5.5.2

After upgrading to CDH 5.5.2 from CDH 5.4.8, legacy mapreduce jobs would stop executing at the beginning of the reduce phase. We tried rebuilding them using the CDH 5.5.2 repo in the pom, but it still would get stuck. This is what we see happening. The map phase would get to 98% done. There would be a few dozen pending while all the rest of the mappers are done, but there would be no mappers running. All pending reducers would be pending. Can anyone explain to me what could be the cause? I hope its a configuration change.

 

Thanks,

Ben

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

How long have the jobs been stuck? Are they slowly making progress, or just have been 98% done for a long time. I am not famaliar with what changes are included in CDH5.5.2 from 5.4.8, it is likely a deadlock in mapreduce. Could you upload the MR Application Master lock so I can take a loot at it?

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Sorry. The AM container log is no longer available. Because it was a production system, we had to rollback. So, I cannot generate a new one. I talked to one of the developers, and he suspects that it could be the map side output format that could be deprecated. Since it stops right dead at 98% completion of the map phase, it could be that the mappers cannot spit out the data to intermediary files to start the reduce phase. Does this logic sound correct to you?

 

Thanks.

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

[ Edited ]

Even if the output format is deprecated, the map tasks shouldn't have been stuck because of that. My suspision is that there was some kind of deadlock between mappers and reducers from your description, though the deadlock is extremely rare. I can't be sure without the log files. You could  always come back later with the log files if this ever happens again. 

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

The only thing I can do is to submit the code here and see if you find any anomalies. Are you or anyone willing to do so?

 

Thanks,

Ben

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

I can try to see what I can find. Did you see the problem only once? Or it is still happening. If it is the former case, most likely it is a deadlock problem in MapReduce. 

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

It has happened multiple times. I mentioned to one of the developers to point me to the code so I can show you. But when he did that, he saw that there were custom libraries being imported by the code. These custom libraries were created by a team long time ago that are no longer with the company. He, then, tried to compile them against CDH 5.5.2 repos, and they all failed. They are still using CDH 4.4.0 repos using MRv1. Do you think this could be it?

 

Thanks,

Ben

Cloudera Employee
Posts: 55
Registered: ‎03-07-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

I don't have knowledge of how MRv1 runs things, thus what incompabilities are between v1 and v2. According to my college, it is unlikely caused by the library issue. Depending on what the library is trying to use, the job themselves may or may not fail. Based on your problem description, the jobs were stuck rather failed. Anyway, there could also be some settings in Yarn as well that may have caused the issue. With the current information, it is hard to tell what the real issue was.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Well, the developer is going to try to recompile them using the latest repos. Some were breaking, and he's going to fix them as best as he could. We will give the upgrade another try in 2 weeks. I will be certain to capture the logs then.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

I am able to run the job on a test cluster we just setup for CDH 5.5.2. Then, we ran the recompiled job. It still hangs. Here is what is in the AM log.

 

stderr:

Apr 22, 2016 11:32:51 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get
WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information.
Apr 22, 2016 11:32:51 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class
Apr 22, 2016 11:32:51 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
Apr 22, 2016 11:32:51 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class
Apr 22, 2016 11:32:51 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
Apr 22, 2016 11:32:51 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
Apr 22, 2016 11:32:52 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
Apr 22, 2016 11:32:52 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"

 

syslog:

2016-04-22 23:43:06,437 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:06,437 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:07,440 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:07,440 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:08,443 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:08,443 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:09,446 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:09,446 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:10,449 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:10,449 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:11,452 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:11,452 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:12,454 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:12,454 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:13,457 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:13,458 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:14,460 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:14,460 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:15,463 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:15,464 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360
2016-04-22 23:43:16,466 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Recalculating schedule, headroom=<memory:420864, vCores:235>
2016-04-22 23:43:16,467 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold not met. completedMapsForReduceSlowstart 360

 

I hope this helps.

 

Thanks,

Ben

Announcements