We are working on solving a problem which takes
10.2GB data set as its input. We have written a map reduce program
which analyzes this 10.2GB dataset and map task produces 33GB
intermediate data and reduce task generates 25GB output data.
We have used NLineInpurFormat as inputformat.We are running this map reduce job in 24 nodes Hadoop2 cluster.
Details of system configuration of each node is as follows
i7 processor, 8cores, 8GB RAM, 360GB hard-disk, Network interface card
1gbps, switch 1gbps. But we are not using a dedicated switch for our
As we have
24*8=192 cores available, we are using all the 192 cores for our map
tasks. That is, we have divided data set into 192 splits so 192 map
tasks are created and all 192 cores are used.We have set number of
reduce tasks as 170.
Apart from setting number of map and reduce tasks, we have not touched any of the hadoop parameters.
Currently our job takes 9 mins 30 seconds for running.
would like to know that are we lacking in setting any of the hadoop
parameters so that our job's performance in terms of time can be
It would be really helpful for us if you can help in improving performance of this problem.
Thanks in advance.