Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

How to calculate the Hadoop cluster size?

New Contributor

How do we calculate hadoop cluster size in our project?



Below formula is used to calculate the cluster size of hadoop:

Where c=average compression ratio. This depends upon the type of compression used and size of the data. When no compression is used, c value will be 1.
R=replication factor. It is set to 3 by default in production cluster.
S = size of data to be moved to Hadoop. This could be a combination of historical data and incremental data. The incremental data can be daily for example and projected over a period of time (3 years for example).
i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to storing intermediate results of Map phase.
Example: With no compression i.e. c=1, a replication factor of 3, an intermediate factor of .25=1/4 H= 13S/(1-1/4)=3S/(3/4)=4S With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.