Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to plan a hortonworks hadoop cluster if my application is not running any map reduce jobs .. and i will be loading 250 GB data in hbase

avatar
Expert Contributor

Hi Team,

can someone kindly advise How to plan a hortonworks hadoop cluster if my application is not running any map reduce jobs .. and i will be loading 250 GB data in hbase

i understand we need to take care of below points

  • How to plan my storage?
  • what to use disks or RAID for NN datanodes?
  • How to plan my CPU?
  • How to plan my memory?
  • How to plan the network bandwidth?
1 ACCEPTED SOLUTION

avatar
Super Guru

@ripunjay godhani

It depends on how much data you plan to write and read from the disk and compare that with what the disks you use provide, double that just to be safe. That is by your design. For example a SATA 3 Gbps provides a theoretical throughput of 600 Mbs, but that is unlilkely to be met. I would count on 50% of that. That leads to about 300 Mbs which is roughly 30 MB/s. If you need about 100 MB/s at peak and you double that then you need about 200 MB/s, that would be roughly 6-7 drives, This is very simplistic because when you have a cluster there is a lot of going on due to block replication. If your network is 1 Gbps your limitation will be 100 MB/s by the network, but you still need each server in the cluster to provide more IOPS for various local operations.

Vote the answer, accept it is as a best answer using the arrow up next to the response.

View solution in original post

4 REPLIES 4

avatar
Super Guru

@ripunjay godhani

Just knowing the amount of data (250 GB) is insufficient for capacity planning. Intake and output is also necessary, data processing requirements. Workloads, concurrency, expected response time, resiliency and availability are also important factors. Those determine what CPU/RAM/Network/DiskIO you need. Account for how much data will be read from the disk vs from memory based on your SLA and design. This is all art of estimation. NN is best to use reliable hardware and RAID is a good option.

Anyhow, it is usually good to begin small and gain experience by measuring actual workloads during a pilot project. We recommend starting with a relatively small pilot cluster, provisioned for a “ balanced ” workload.

For pilot deployments, you can start with 1U/machine and use the following recommendations:

Two quad core CPUs | 12 GB to 24 GB memory | Four to six disk drives of 2 terabyte (TB) capacity. Even you have only 250 GB, that is multiplied by a replication factor of 3, you need temporary space, plus room to grow. Multiple spindles will also give enough IOPS for disk operations. Don't think only in matter of "I need 250 GB storage".

The minimum requirement for network is 1GigE all-to-all and can be easily achieved by connecting all of your nodes to a Gigabyte Ethernet switch. In order to use the spare socket for adding more CPUs in future, you can also consider using either a six or an eight core CPU.

For more check the following references:

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_cluster-planning-guide/content/balanced-...

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Sys_Admin_Guides/content/ch_clust_capaci...

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Sys_Admin_Guides/content/ch_hbase_io.htm...

If you find this response helpful, accept it as a best answer.

avatar
Expert Contributor

answer looks goo d.. Thanks for your answer can you please advise how to decide DiskIO in cluster ?

which factors to consider for Disk I/O calculation ?

avatar
Super Guru

@ripunjay godhani

It depends on how much data you plan to write and read from the disk and compare that with what the disks you use provide, double that just to be safe. That is by your design. For example a SATA 3 Gbps provides a theoretical throughput of 600 Mbs, but that is unlilkely to be met. I would count on 50% of that. That leads to about 300 Mbs which is roughly 30 MB/s. If you need about 100 MB/s at peak and you double that then you need about 200 MB/s, that would be roughly 6-7 drives, This is very simplistic because when you have a cluster there is a lot of going on due to block replication. If your network is 1 Gbps your limitation will be 100 MB/s by the network, but you still need each server in the cluster to provide more IOPS for various local operations.

Vote the answer, accept it is as a best answer using the arrow up next to the response.

avatar
Expert Contributor

Thanks a lot