Created 07-31-2017 01:44 PM
Hello All, I would like to know that where map's intermediate data is written that is context.write() writes data to hard disk or network immediately after its generation? Which Hadoop parameter to be tuned when the amount of intermediate data generated by map() task is over 45 GB and a huge amount of data(for example above 50 GB) to be shuffled over the network in a multinode cluster set up? Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data? Thanks in advance.
Created 08-07-2017 10:06 AM
Map intermediate data will be written and sorted on local disk before sending to the reducer machines.
You can reduce Map output
mapred.map.tasks.speculative.execution=false)
bq. Will i get any performance improvement if i increase io.sort.mb paramter when Map() task generates huge amount of data?
Yes (but impact may not be huge), you can use with io.sort.factor
Created 08-08-2017 03:20 PM
Hi Ankit, I'm already using Gzip for compressing my reduce tasks output. But If i use gzip compression for map output i will not be able to split map output among reducers. correct me if i am wrong!!!? so i didnt use compression for map output.
how to update sort algorithm? you have any tutorial for doing this?
Also, can you explain me how to set io.sort.mb and io.sort.factor parameters?
 
					
				
				
			
		
