Created on 08-25-201602:08 PM - edited 08-17-201910:37 AM
In the recent weeks I have tested Hadoop on various IaaS providers in hope to find additional performance insights. BigStep blew away my expectation in terms of Hadoop performance on IaaS. I wanted to take the testing a step further. Lets quantify performance measures by adding nodes to a cluster. Even for a small 1TB data set, would 5 nodes perform far greater then 3? I have heard a few times when it comes to small datasets adding more nodes may not have a impact. So this led me to test a 3 node cluster vs a 5 node cluster using 1TB dataset. Does the extra 2 nodes increase performance with processing and IO? Lets find out.
Started the testing with dfsio which is a distributed IO benchmark tool. Here are results:
From 3 to 5 data nodes IO read performance increased approx. 36%
From 3 to 5 data nodes IO write performance increased approx. 49%
With 2 additional data nodes a performance IO throughput of 49%! Wish I had more boxes to play with. Can't image where this would take the measures!
Now lets compare running TeraGen on 3 and 5 data nodes. TeraGen is a map/reduce program to generate the data.
From 3 to 5 data nodes TeraGen performance increased approx. 65%
Now lets compare running TeraSort on 3 and 5 data nodes. TeraSort samples the input data and uses map/reduce to sort the data into a total order.
From 3 to 5 data nodes TeraSort performance increased approx. 54%.
Now lets compare running TeraValidate on 3 and 5 data nodes. TeraValidate is a map/reduce program that validates the output is sorted.
From 3 to 5 data nodes TeraValidate performance increased approx. 64%.
DFSIO read/write, TeraGen,TeraSort, andTeraValidate test all experienced minimum 50% performance increase. So the theory of throwing more nodes at hadoop increases performance seems to be justified. And yes that is with a small dataset. You do have to consider your use case before using a blanket statement like that. However the physics and software engineering principles of Hadoop support the idea of horizontal scalability and therefore the test make complete sense to me. Hope this provided some insights in terms of # of nodes to possible performance increase expectations.