Dear Members, I have written a MapReduce job in which Map task takes
around 10 GB input data set and after processing it, map task generates
33GB intermediate data. Then, reduce task processes this 33 GB data and
produces final output whose size is 25 GB. I have used NLineInputFormat
as an Input Format for this job. Number of map task created is 64 and
number of reduce tasks I have asked Hadoop to create is 52. My cluster
configuration is as follows: 8 nodes in the cluster, Each node contains
the following: Intel i7 processor with 8 cores, 8GB RAM, 350GB HDD.
note that every line in input data set of varying in length. As i have
mentioned "n" number of lines each map task has to process, the amount
of data each map task processes is varying. That is some map task
processes 150MB data and another map task processes only 40 MB data.
During the time of feeding job execution command in terminal, I mention only the following property
-D mapreduce.reduce.shuffle.input.buffer.percent=0.4 as i was getting the error message "error is shuffle in fetcher"
this set up, my job takes around 25 minutes to complete. Am i missing
any performance tuning parameter here? If i have to try with some
performance tuning parameters, what are they? I need your advices.Thanks