Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Map and Reduce Error: Java heap space

Solved Go to solution

Re: Map and Reduce Error: Java heap space

Champion

@desind

 

To add on to your point, the cluster setup is applicable to all the mapreduce job, so it may impact other non-mapreduce jobs. 

 

In fact I am not against setup higher value in cluster itself, but you can do that based on how many jobs requires higher values and performance, etc

 

 

Re: Map and Reduce Error: Java heap space

Expert Contributor

@saranvisa After increasing reducer heap and opts the job worked for a few days and now we are seeing this issue again. where not a single reducer completes and we are seeing job failure after 4 hrs and ALL reducers are failed. Not a single one completes

 

failed reducer log:

 

dfs.DFSClient: Slow waitForAckedSeqno took 38249ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/custom_attribute_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103397861_29724995, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.21:50010,DS-d5d914b6-6886-443b-9e39-8347c24cc9b7,DISK], DatanodeInfoWithStorage[10.239.121.56:50010,DS-63498815-70ea-48e2-b701-f0c439e38711,DISK]]
2018-03-19 23:54:17,315 WARN [main] org.apache.hadoop.hdfs.DFSClient: Slow waitForAckedSeqno took 35411ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/documents_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103400051_29727493, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.176:50010,DS-ae2d35e1-7a7e-44dc-9016-1d11881d49cc,DISK], DatanodeInfoWithStorage[10.239.121.115:50010,DS-86b207ef-b8ce-4a9f-9f6f-ddc182695296,DISK]]
2018-03-19 23:54:51,983 WARN [main] org.apache.hadoop.hdfs.DFSClient: Slow waitForAckedSeqno took 34579ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/form_path_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103400111_29727564, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.176:50010,DS-ae2d35e1-7a7e-44dc-9016-1d11881d49cc,DISK], DatanodeInfoWithStorage[10.239.121.21:50010,DS-d5d914b6-6886-443b-9e39-8347c24cc9b7,DISK]]
2018-03-19 23:55:47,506 WARN [main] org.apache.hadoop.hdfs.DFSClient: Slow waitForAckedSeqno took 55388ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/media_hr_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103400160_29727615, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.176:50010,DS-ae2d35e1-7a7e-44dc-9016-1d11881d49cc,DISK], DatanodeInfoWithStorage[10.239.121.56:50010,DS-63498815-70ea-48e2-b701-f0c439e38711,DISK]]
2018-03-19 23:55:47,661 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at java.lang.String.replaceAll(String.java:2223)
at com.xxx.ci.acs.extract.CXAService$myReduce.parseEvent(CXAService.java:1589)
at com.xxx.ci.acs.extract.CXAService$myReduce.reduce(CXAService.java:915)
at com.xxx.ci.acs.extract.CXAService$myReduce.reduce(CXAService.java:233)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2018-03-19 23:55:47,763 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask metrics system...
2018-03-19 23:55:47,763 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system stopped.
2018-03-19 23:55:47,763 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system shutdown complete.

 

What are the other tuning parameters we can try ?

Re: Map and Reduce Error: Java heap space

Contributor

Hi, getting back on this old topic to have more answers in this subject.

 

I have errors with mappers and reducers falling short on memory. Of course increasing the memory fix the issue, but as already menthioned I am wasting memory for jobs that doesn't need it. 

Plus, I was thinking that this stuff was made to scale, so it would handle a particularly great job just buy splitting it.

In other words, I don't want to change memory values every time a new application fails due to memory limits. 

 

What is the best practice in this case?


Thanks

O.

Re: Map and Reduce Error: Java heap space

Expert Contributor

In our case the reducers were failing with OOM issue, so we first increased the reducer memory (mapreduce.reduce.memory.mb) and the mapreduce.reduce.java.opts . After a few days the job again failed. 

Sp we had up to keep the existing memory and increase the number of reducers from 40 to 60. This resolved our issue and we havent seen a failure since then. We cannot keep increasing memory for reducers which could cause other issues. 

 

A lower number of reducers will create fewer, but larger, output files. A good rule of thumb is to tune the number of reducers so that the output files are at least a half a block size.

If the reducers are completing faster and generating small files then we have too many reducers which was not in our case. 

Re: Map and Reduce Error: Java heap space

Contributor
Ok I understand your point but what if mappers are failing ? Yarn already sets up as many mappers as files number, should I increase this more ?
Since only a minority of my jobs are failing, how can I tune yarn to use more mappers for these particular jobs?