Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Where context.write(key,value) method's output(intermediate data) of map function is written? Is it written into hard disk or send to network immediately after generation?

Where context.write(key,value) method's output(intermediate data) of map function is written? Is it written into hard disk or send to network immediately after generation?

New Contributor

Hello All, I would like to know that where map's intermediate data is written that is context.write() writes data to hard disk or network immediately after its generation? Which Hadoop parameter to be tuned when the amount of intermediate data generated by map() task is over 45 GB and a huge amount of data(for example above 50 GB) to be shuffled over the network in a multinode cluster set up? Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data? Thanks in advance.

2 REPLIES 2
Highlighted

Re: Where context.write(key,value) method's output(intermediate data) of map function is written? Is it written into hard disk or send to network immediately after generation?

Map intermediate data will be written and sorted on local disk before sending to the reducer machines.

You can reduce Map output

  • Use Combiner in between
  • Compressing it with Gzip to save network IO but there will be a tradeoff for CPU (mapred.compress.map.output=true, mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec)
  • Decrease split size(this will distribute map across server) and increase the number of reducers so that they have fewer amount of data to sort and process
  • stop speculative execution(

    mapred.map.tasks.speculative.execution=false)

  • If you can optimize on sorting, update algorithm for map.sort.class

bq. Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data?

Yes (but impact may not be huge), you can use with io.sort.factor

Re: Where context.write(key,value) method's output(intermediate data) of map function is written? Is it written into hard disk or send to network immediately after generation?

New Contributor

Hi Ankit, I'm already using Gzip for compressing my reduce tasks output. But If i use gzip compression for map output i will not be able to split map output among reducers. correct me if i am wrong!!!? so i didnt use compression for map output.

how to update sort algorithm? you have any tutorial for doing this?

Also, can you explain me how to set io.sort.mb and io.sort.factor parameters?

Don't have an account?
Coming from Hortonworks? Activate your account here