Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Map and Reduce Error: Java heap space

avatar
Explorer

I'm using QuickStart VM with CHD5.3, trying to run modified sample from MR-parquet read. It is worked OK on 10M rows parquet table, but I've got "Java heap space" error on table having 40M rows:

 

[cloudera@quickstart sep]$ yarn jar testmr-1.0-SNAPSHOT.jar TestReadParquet /user/hive/warehouse/parquet_table out_file18 -Dmapreduce.reduce.memory.mb=5120 -Dmapreduce.reduce.java.opts=-Xmx4608m -Dmapreduce.map.memory.mb=5120 -Dmapreduce.map.java.opts=-Xmx4608m
16/10/03 12:19:30 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/10/03 12:19:31 INFO input.FileInputFormat: Total input paths to process : 1
Oct 03, 2016 12:19:31 PM parquet.Log info
INFO: Total input paths to process : 1
Oct 03, 2016 12:19:31 PM parquet.Log info
INFO: Initiating action with parallelism: 5
Oct 03, 2016 12:19:31 PM parquet.Log info
INFO: reading another 1 footers
Oct 03, 2016 12:19:31 PM parquet.Log info
INFO: Initiating action with parallelism: 5
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
16/10/03 12:19:31 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
16/10/03 12:19:31 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
Oct 03, 2016 12:19:31 PM parquet.Log info
INFO: There were no row groups that could be dropped due to filter predicates
16/10/03 12:19:32 INFO mapreduce.JobSubmitter: number of splits:1
16/10/03 12:19:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1475517800829_0009
16/10/03 12:19:33 INFO impl.YarnClientImpl: Submitted application application_1475517800829_0009
16/10/03 12:19:33 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1475517800829_0009/
16/10/03 12:19:33 INFO mapreduce.Job: Running job: job_1475517800829_0009
16/10/03 12:19:47 INFO mapreduce.Job: Job job_1475517800829_0009 running in uber mode : false
16/10/03 12:19:47 INFO mapreduce.Job: map 0% reduce 0%
16/10/03 12:20:57 INFO mapreduce.Job: map 100% reduce 0%
16/10/03 12:20:57 INFO mapreduce.Job: Task Id : attempt_1475517800829_0009_m_000000_0, Status : FAILED
Error: Java heap space
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143


Also I've tryed to edit /etc/hadoop/conf/mapred-site.xml, tryed via cloudera manager GUI (clusters->hdfs-> ... Java Heap Size of DataNode in Bytes )

 

[cloudera@quickstart sep]$ free -m
total used free shared buffers cached
Mem: 13598 13150 447 0 23 206
-/+ buffers/cache: 12920 677
Swap: 6015 2187 3828

 

Mapper class:

 

public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {

@Override
public void map(LongWritable key, Group value, Context context) throws IOException, InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
// Get the schema and field values of the record
// String inputRecord = value.toString();
// Process the value, create an output record
// ...
int field1 = value.getInteger("x", 0);

if (field1 < 3) {
context.write(outKey, new Text(outputRecord));
}
}
}

 

1 ACCEPTED SOLUTION

avatar
Champion

Please add some more memory by editing the mapred-site.xml

 

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx4096m</value>
</property>

The above tag i have used 5gb.

Let me know if that helped you

 

alternatively you can also edit the hadoop-env.sh file 

add 

export HADOOP_OPTS="-Xmx5096m"

 

View solution in original post

24 REPLIES 24

avatar
Champion

@desind

 

To add on to your point, the cluster setup is applicable to all the mapreduce job, so it may impact other non-mapreduce jobs. 

 

In fact I am not against setup higher value in cluster itself, but you can do that based on how many jobs requires higher values and performance, etc

 

 

avatar
Expert Contributor

@saranvisa After increasing reducer heap and opts the job worked for a few days and now we are seeing this issue again. where not a single reducer completes and we are seeing job failure after 4 hrs and ALL reducers are failed. Not a single one completes

 

failed reducer log:

 

dfs.DFSClient: Slow waitForAckedSeqno took 38249ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/custom_attribute_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103397861_29724995, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.21:50010,DS-d5d914b6-6886-443b-9e39-8347c24cc9b7,DISK], DatanodeInfoWithStorage[10.239.121.56:50010,DS-63498815-70ea-48e2-b701-f0c439e38711,DISK]]
2018-03-19 23:54:17,315 WARN [main] org.apache.hadoop.hdfs.DFSClient: Slow waitForAckedSeqno took 35411ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/documents_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103400051_29727493, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.176:50010,DS-ae2d35e1-7a7e-44dc-9016-1d11881d49cc,DISK], DatanodeInfoWithStorage[10.239.121.115:50010,DS-86b207ef-b8ce-4a9f-9f6f-ddc182695296,DISK]]
2018-03-19 23:54:51,983 WARN [main] org.apache.hadoop.hdfs.DFSClient: Slow waitForAckedSeqno took 34579ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/form_path_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103400111_29727564, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.176:50010,DS-ae2d35e1-7a7e-44dc-9016-1d11881d49cc,DISK], DatanodeInfoWithStorage[10.239.121.21:50010,DS-d5d914b6-6886-443b-9e39-8347c24cc9b7,DISK]]
2018-03-19 23:55:47,506 WARN [main] org.apache.hadoop.hdfs.DFSClient: Slow waitForAckedSeqno took 55388ms (threshold=30000ms). File being written: /user/hadoop/normalization/6befd9a02400013179aba889/16cb62ff-463a-448b-b1d3-1cf5bb254466/_temporary/1/_temporary/attempt_1517244318452_37939_r_000028_0/media_hr_dir/part-00028.gz, block: BP-71764089-10.239.121.82-1481226593627:blk_1103400160_29727615, Write pipeline datanodes: [DatanodeInfoWithStorage[10.239.121.39:50010,DS-15b1c936-e838-41a2-ab40-7889aab95982,DISK], DatanodeInfoWithStorage[10.239.121.176:50010,DS-ae2d35e1-7a7e-44dc-9016-1d11881d49cc,DISK], DatanodeInfoWithStorage[10.239.121.56:50010,DS-63498815-70ea-48e2-b701-f0c439e38711,DISK]]
2018-03-19 23:55:47,661 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.regex.Pattern.matcher(Pattern.java:1093)
at java.lang.String.replaceAll(String.java:2223)
at com.xxx.ci.acs.extract.CXAService$myReduce.parseEvent(CXAService.java:1589)
at com.xxx.ci.acs.extract.CXAService$myReduce.reduce(CXAService.java:915)
at com.xxx.ci.acs.extract.CXAService$myReduce.reduce(CXAService.java:233)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2018-03-19 23:55:47,763 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask metrics system...
2018-03-19 23:55:47,763 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system stopped.
2018-03-19 23:55:47,763 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system shutdown complete.

 

What are the other tuning parameters we can try ?

avatar
Explorer

Hi, getting back on this old topic to have more answers in this subject.

 

I have errors with mappers and reducers falling short on memory. Of course increasing the memory fix the issue, but as already menthioned I am wasting memory for jobs that doesn't need it. 

Plus, I was thinking that this stuff was made to scale, so it would handle a particularly great job just buy splitting it.

In other words, I don't want to change memory values every time a new application fails due to memory limits. 

 

What is the best practice in this case?


Thanks

O.

avatar
Expert Contributor

In our case the reducers were failing with OOM issue, so we first increased the reducer memory (mapreduce.reduce.memory.mb) and the mapreduce.reduce.java.opts . After a few days the job again failed. 

Sp we had up to keep the existing memory and increase the number of reducers from 40 to 60. This resolved our issue and we havent seen a failure since then. We cannot keep increasing memory for reducers which could cause other issues. 

 

A lower number of reducers will create fewer, but larger, output files. A good rule of thumb is to tune the number of reducers so that the output files are at least a half a block size.

If the reducers are completing faster and generating small files then we have too many reducers which was not in our case. 

avatar
Explorer
Ok I understand your point but what if mappers are failing ? Yarn already sets up as many mappers as files number, should I increase this more ?
Since only a minority of my jobs are failing, how can I tune yarn to use more mappers for these particular jobs?