Support Questions

saranpons3 · ‎07-31-2017

Hello All, I would like to know that where map's intermediate data is written that is context.write() writes data to hard disk or network immediately after its generation? Which Hadoop parameter to be tuned when the amount of intermediate data generated by map() task is over 45 GB and a huge amount of data(for example above 50 GB) to be shuffled over the network in a multinode cluster set up? Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data? Thanks in advance.

asinghal · ‎08-07-2017

Map intermediate data will be written and sorted on local disk before sending to the reducer machines.

You can reduce Map output

Use Combiner in between
Compressing it with Gzip to save network IO but there will be a tradeoff for CPU (mapred.compress.map.output=true, mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec)
Decrease split size(this will distribute map across server) and increase the number of reducers so that they have fewer amount of data to sort and process
stop speculative execution(
mapred.map.tasks.speculative.execution=false)
If you can optimize on sorting, update algorithm for map.sort.class

bq. Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data?

Yes (but impact may not be huge), you can use with io.sort.factor

saranpons3 · ‎08-08-2017

Hi Ankit, I'm already using Gzip for compressing my reduce tasks output. But If i use gzip compression for map output i will not be able to split map output among reducers. correct me if i am wrong!!!? so i didnt use compression for map output.

how to update sort algorithm? you have any tutorial for doing this?

Also, can you explain me how to set io.sort.mb and io.sort.factor parameters?

Cloudera Community

Support Questions

Where context.write(key,value) method's output(intermediate data) of map function is written? Is it written into hard disk or send to network immediately after generation?

Uploading Files for Cloudera Support - alternate m...

Spark terasort : How to compress output data writt...

Where are Nifi attributes written?

Migrating Kafka partitions data to new disk locati...

Decommission and Reconfigure Data Node Disks

Has anyone sucessfully written an Atlas Hook? (kaf...

Drones (UAV) Data Ingest Methods

Apache Kafka Network Bandwidth Quotas

NiFi - Send to syslog

HDFS Balancer: Balancing Data Between Disks on a D...