Created 04-15-2020 08:34 AM
HI Team ,
We are looking for cluster setup to manage 300TB of hdfs volume . With 5% of data increment on weekly basis .
how can we calculate and meet the above requirement .?
what should be the hdfs and non hdfs space on each data nodes ?
what should be the space allocation for edge node .
Also the resources to be allocated to each datanode .
it would be great if you post the steps on what basis you will calculating the disk size and resource allocation .
Thanks
Created on 04-16-2020 10:02 AM - edited 04-16-2020 10:06 AM
Hi @Ashik ,
A good rule of thumb for the amount of HDFS storage required is 4 x the raw data volume. HDFS triple replicates data and then we need some headroom in the system which is why it is 4 x rather than 3 x . This formula is just a rough guide and can change for example if you compress the data on HDFS. You need to factor other data processing that you might do into this calculation. For example, if you built data marts on top of the raw data - that is additional data volume and then you have organic data growth over a period of time.
Regarding cluster topology, there are some guidelines here:
https://docs.cloudera.com/documentation/enterprise/5/latest/topics/cm_ig_host_allocations.html
Regarding best practice for cluster sizing, those are here:
Regarding hardware recommendations, those are given here:
I would recommend that you have:
Regards,
Steve
Created on 04-16-2020 10:02 AM - edited 04-16-2020 10:06 AM
Hi @Ashik ,
A good rule of thumb for the amount of HDFS storage required is 4 x the raw data volume. HDFS triple replicates data and then we need some headroom in the system which is why it is 4 x rather than 3 x . This formula is just a rough guide and can change for example if you compress the data on HDFS. You need to factor other data processing that you might do into this calculation. For example, if you built data marts on top of the raw data - that is additional data volume and then you have organic data growth over a period of time.
Regarding cluster topology, there are some guidelines here:
https://docs.cloudera.com/documentation/enterprise/5/latest/topics/cm_ig_host_allocations.html
Regarding best practice for cluster sizing, those are here:
Regarding hardware recommendations, those are given here:
I would recommend that you have:
Regards,
Steve
Created 04-16-2020 10:05 AM