We are working on solving a problem which takes
10.2GB data set as its input. We have written a map reduce program
which analyzes this 10.2GB dataset and map task produces 33GB
intermediate data and reduce task generates 25GB output data.
We have used NLineInpurFormat as inputformat.We are running this map reduce job in 24 nodes Hadoop2 cluster.
Details of system configuration of each node is as follows
i7 processor, 8cores, 8GB RAM, 360GB hard-disk, Network interface card
1gbps, switch 1gbps. But we are not using a dedicated switch for our
As we have
24*8=192 cores available, we are using all the 192 cores for our map
tasks. That is, we have divided data set into 192 splits so 192 map
tasks are created and all 192 cores are used.We have set number of
reduce tasks as 170.
Apart from setting number of map and reduce tasks, we have not touched any of the hadoop parameters.
Currently our job takes 9 mins 30 seconds for running.
would like to know that are we lacking in setting any of the hadoop
parameters so that our job's performance in terms of time can be
It would be really helpful for us if you can help in improving performance of this problem.
One way which I could think of is increasing the splits of the file. Also check whether the size of mapper/reducer is equal to the size of the HDFS block size. So that More no of mapred jobs can run in parallel on different blocks. Check whether the data are distributed equally. If you can compress the source file then try LZO compression on the source files so that the no of I\O bound will be reduced which is turn decides the mapred jobs.These are at all high level check which you can perform.
Already, number of splits is 192. Should i still increase number of splits above 192? My each split is of not equal size. This is because length of each line in my data set is not fixed. I used nlinespermap property to make every map to get same number of lines for processing. But as the length of every line is not fixed, the split size is not same among all mappers.
Is it good or bad that utilizing all the cores available in the clusters for map tasks in this situation? How about using Techyon for this problem. will i see any performance improvement if i use tachyon? Thanks in advance for your replies.