- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to calculate cluster size as well as no of node requirement
- Labels:
-
HDFS
Created ‎04-15-2020 08:34 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI Team ,
We are looking for cluster setup to manage 300TB of hdfs volume . With 5% of data increment on weekly basis .
how can we calculate and meet the above requirement .?
what should be the hdfs and non hdfs space on each data nodes ?
what should be the space allocation for edge node .
Also the resources to be allocated to each datanode .
it would be great if you post the steps on what basis you will calculating the disk size and resource allocation .
Thanks
Created on ‎04-16-2020 10:02 AM - edited ‎04-16-2020 10:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Ashik ,
A good rule of thumb for the amount of HDFS storage required is 4 x the raw data volume. HDFS triple replicates data and then we need some headroom in the system which is why it is 4 x rather than 3 x . This formula is just a rough guide and can change for example if you compress the data on HDFS. You need to factor other data processing that you might do into this calculation. For example, if you built data marts on top of the raw data - that is additional data volume and then you have organic data growth over a period of time.
Regarding cluster topology, there are some guidelines here:
https://docs.cloudera.com/documentation/enterprise/5/latest/topics/cm_ig_host_allocations.html
Regarding best practice for cluster sizing, those are here:
Regarding hardware recommendations, those are given here:
I would recommend that you have:
- 3 x Master Nodes (for high availability)
- N x Data Nodes (where N is number based on the storage capacity of the data nodes). You need a minimum of N=3 for triple replication of data and I would recommend N >= 5 for a production system. The more data nodes that you have and the more disks there are in each of those data nodes the higher the performance of your system will be because of the distributed throughput and higher disk I/O.
- 1 x Utility Node / Management Node
- 1 x Gateway Node
Regards,
Steve
Created on ‎04-16-2020 10:02 AM - edited ‎04-16-2020 10:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Ashik ,
A good rule of thumb for the amount of HDFS storage required is 4 x the raw data volume. HDFS triple replicates data and then we need some headroom in the system which is why it is 4 x rather than 3 x . This formula is just a rough guide and can change for example if you compress the data on HDFS. You need to factor other data processing that you might do into this calculation. For example, if you built data marts on top of the raw data - that is additional data volume and then you have organic data growth over a period of time.
Regarding cluster topology, there are some guidelines here:
https://docs.cloudera.com/documentation/enterprise/5/latest/topics/cm_ig_host_allocations.html
Regarding best practice for cluster sizing, those are here:
Regarding hardware recommendations, those are given here:
I would recommend that you have:
- 3 x Master Nodes (for high availability)
- N x Data Nodes (where N is number based on the storage capacity of the data nodes). You need a minimum of N=3 for triple replication of data and I would recommend N >= 5 for a production system. The more data nodes that you have and the more disks there are in each of those data nodes the higher the performance of your system will be because of the distributed throughput and higher disk I/O.
- 1 x Utility Node / Management Node
- 1 x Gateway Node
Regards,
Steve
Created ‎04-16-2020 10:05 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would be great helpful with information you have provided
Thanks
Ashik
Sent from my iPhone
