Support Questions
Find answers, ask questions, and share your expertise

How to tune performance of Hadoop MapReduce Job running in a cluster?


Dear Members, I have written a MapReduce job in which Map task takes around 10 GB input data set and after processing it, map task generates 33GB intermediate data. Then, reduce task processes this 33 GB data and produces final output whose size is 25 GB. I have used NLineInputFormat as an Input Format for this job. Number of map task created is 64 and number of reduce tasks I have asked Hadoop to create is 52. My cluster configuration is as follows: 8 nodes in the cluster, Each node contains the following: Intel i7 processor with 8 cores, 8GB RAM, 350GB HDD.

Please note that every line in input data set of varying in length. As i have mentioned "n" number of lines each map task has to process, the amount of data each map task processes is varying. That is some map task processes 150MB data and another map task processes only 40 MB data.

During the time of feeding job execution command in terminal, I mention only the following property

-D mapreduce.reduce.shuffle.input.buffer.percent=0.4 as i was getting the error message "error is shuffle in fetcher"

With this set up, my job takes around 25 minutes to complete. Am i missing any performance tuning parameter here? If i have to try with some performance tuning parameters, what are they? I need your advices.Thanks in advance.