Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

MapReduce jobs stop executing after upgrading to CDH 5.5.2

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

There are 64 completed mappers. 386 pending. The note next to it says:

 

Task attempt attempt_1461702427412_0001_m_000014_0 is done from TaskUmbilicalProtocol's point of view. However, it stays in finishing state for too long

 

Below are the contents of the syslog of one of the completed mappers. Hope this helps.

 

2016-04-26 20:44:46,024 WARN [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Metrics system not started: org.apache.commons.configuration.ConfigurationException: Unable to load the configuration from the URL file:/var/run/cloudera-scm-agent/process/106245-yarn-NODEMANAGER/hadoop-metrics2.properties
2016-04-26 20:44:46,107 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2016-04-26 20:44:46,107 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1461702427412_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@43be5d17)
2016-04-26 20:44:46,207 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2016-04-26 20:44:46,614 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /mnt/data1/yarn/nm/usercache/gxetl/appcache/application_1461702427412_0001,/mnt/data2/yarn/nm/usercache/gxetl/appcache/application_1461702427412_0001,/mnt/data3/yarn/nm/usercache/gxetl/appcache/application_1461702427412_0001,/mnt/data4/yarn/nm/usercache/gxetl/appcache/application_1461702427412_0001,/mnt/data5/yarn/nm/usercache/gxetl/appcache/application_1461702427412_0001,/mnt/data6/yarn/nm/usercache/gxetl/appcache/application_1461702427412_0001
2016-04-26 20:44:47,585 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2016-04-26 20:44:48,144 INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
2016-04-26 20:44:48,155 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2016-04-26 20:44:48,441 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://nameservice1/user/gxetl/tmp/stage01/2016/04/06/01/45/201604060145-r-00010:268435456+145155877
2016-04-26 20:44:49,280 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 429391868(1717567472)
2016-04-26 20:44:49,280 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1638
2016-04-26 20:44:49,280 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 1374054016
2016-04-26 20:44:49,280 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1717567488
2016-04-26 20:44:49,280 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 429391868; length = 107347968
2016-04-26 20:44:49,290 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2016-04-26 20:44:49,339 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
2016-04-26 20:44:49,344 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
2016-04-26 20:44:49,345 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
2016-04-26 20:44:49,345 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
2016-04-26 20:44:49,774 INFO [main] com.gradientx.gxetl.join.JoinMapper: START SETUP
2016-04-26 20:44:49,779 INFO [main] com.gradientx.gxetl.join.JoinMapper: Working on input split hdfs://nameservice1/user/gxetl/tmp/stage01/2016/04/06/01/45/201604060145-r-00010
2016-04-26 20:45:14,679 INFO [main] com.gradientx.gxetl.join.JoinMapper: END SETUP
2016-04-26 20:46:12,925 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2016-04-26 20:46:12,925 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-04-26 20:46:12,925 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 640429925; bufvoid = 1717567488
2016-04-26 20:46:12,925 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 429391868(1717567472); kvend = 428564456(1714257824); length = 827413/107347968
2016-04-26 20:46:13,381 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
2016-04-26 20:46:16,045 INFO [main] org.apache.hadoop.mapred.MapTask: Finished spill 0
2016-04-26 20:46:16,049 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1461702427412_0001_m_000007_0 is done. And is in the process of committing
2016-04-26 20:46:16,138 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1461702427412_0001_m_000007_0' done.
2016-04-26 20:46:16,239 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

The completed mapper logs do not provide new information other than that the mappers are done. Getting logs of a stuck container and turn the node-manager log level to DEBUG (This is always INFO for all logs you linked here) will be very helpful to identify why completed containers are stuck.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

You can also try to turn off node manager recovery. This will prevent node manager from trying to recover previous uncleaned containers. 

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

I have all logging thresholds set to Debug already. Should I go even lower to Trace?

 

Also, how do I turn off recovery in Cloudera Manager? I only see the setting for NodeManager Recovery Directory.

 

Thanks.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

In CM, go to Yarn- configuration- seach for "log", you will find a configuration option "NodeManager Logging Threshold" which provides you with 6 radio buttons and then restart YARN.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

Looks like Cloudera Manager hides the NodeManager recovery option. Let's do not change it then.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

The logging threshold is at Debug. And YARN was restarted. I'll leave the recovery setting alone since I cannot change it. The only problem left is that there are no hanging tasks running. Either the tasks are complete or pending. Do you have any ideas on how to address this?

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

This could be the direct consequence of the container issue. At some point, containers take up all the resources and cannot be released.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

Do you have any more advice? I am starting to feel that the upgrade will be impossible. We might have to go back to installing packages instead of parcels and only upgrade the other components besides HDFS and YARN. But, still, there will be risks here too.

Re: MapReduce jobs stop executing after upgrading to CDH 5.5.2

Rising Star

The only potential issue that I noticed is the cgroup setting. Containers were running as noboday where as the required user was  something like "*getl*". I confirmed with my college who understands YARN in much more details and in deeper level that this could lead to containers to hang. Other than that, I'm afraid that I do not have more suggestions of how to address your problem.