Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to calculate cluster size as well as no of node requirement

Solved Go to solution

How to calculate cluster size as well as no of node requirement

New Contributor

HI Team ,

 

We are looking for cluster setup to manage 300TB of hdfs volume . With 5% of data increment on weekly basis . 

how can we calculate and meet the above requirement .?

what should be the hdfs and non hdfs space on each data nodes ?

what should be the space allocation for edge node .

Also the resources  to be allocated to each datanode .

 

it would be great if you post the steps on what basis you will calculating the disk size and resource allocation .

 

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to calculate cluster size as well as no of node requirement

Expert Contributor

Hi @Ashik ,

 

A good rule of thumb for the amount of HDFS storage required is 4 x the raw data volume. HDFS triple replicates data and then we need some headroom in the system which is why it is 4 x rather than 3 x . This formula is just a rough guide and can change for example if you compress the data on HDFS. You need to factor other data processing that you might do into this calculation. For example, if you built data marts on top of the raw data - that is additional data volume and then you have organic data growth over a period of time.

 

Regarding cluster topology, there are some guidelines here:

 

https://docs.cloudera.com/documentation/enterprise/5/latest/topics/cm_ig_host_allocations.html

 

Regarding best practice for cluster sizing, those are here:

 

https://docs.cloudera.com/documentation/other/reference-architecture/topics/ra_bare_metal_deployment...

 

Regarding hardware recommendations, those are given here:

 

https://docs.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide....

 

I would recommend that you have:

 

  • 3 x Master Nodes (for high availability)
  • N x Data Nodes (where N is number based on the storage capacity of the data nodes). You need a minimum of N=3 for triple replication of data and I would recommend N >= 5 for a production system. The more data nodes that you have and the more disks there are in each of those data nodes the higher the performance of your system will be because of the distributed throughput and higher disk I/O.
  • 1 x Utility Node / Management Node
  • 1 x Gateway Node

Regards,

Steve

 

 

View solution in original post

2 REPLIES 2
Highlighted

Re: How to calculate cluster size as well as no of node requirement

Expert Contributor

Hi @Ashik ,

 

A good rule of thumb for the amount of HDFS storage required is 4 x the raw data volume. HDFS triple replicates data and then we need some headroom in the system which is why it is 4 x rather than 3 x . This formula is just a rough guide and can change for example if you compress the data on HDFS. You need to factor other data processing that you might do into this calculation. For example, if you built data marts on top of the raw data - that is additional data volume and then you have organic data growth over a period of time.

 

Regarding cluster topology, there are some guidelines here:

 

https://docs.cloudera.com/documentation/enterprise/5/latest/topics/cm_ig_host_allocations.html

 

Regarding best practice for cluster sizing, those are here:

 

https://docs.cloudera.com/documentation/other/reference-architecture/topics/ra_bare_metal_deployment...

 

Regarding hardware recommendations, those are given here:

 

https://docs.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide....

 

I would recommend that you have:

 

  • 3 x Master Nodes (for high availability)
  • N x Data Nodes (where N is number based on the storage capacity of the data nodes). You need a minimum of N=3 for triple replication of data and I would recommend N >= 5 for a production system. The more data nodes that you have and the more disks there are in each of those data nodes the higher the performance of your system will be because of the distributed throughput and higher disk I/O.
  • 1 x Utility Node / Management Node
  • 1 x Gateway Node

Regards,

Steve

 

 

View solution in original post

Highlighted

Re: How to calculate cluster size as well as no of node requirement

New Contributor
Thank you steve for your response .

It would be great helpful with information you have provided

Thanks
Ashik


Sent from my iPhone
Don't have an account?
Coming from Hortonworks? Activate your account here