Created on 10-21-2016 12:06 AM - edited 09-16-2022 01:36 AM
There's no simple rule of thumb for this, it's as much an art as it is a science, as it depends on the workloads and how chatty they are with your current ZKs.
One way to look at this is:
Warning: the more ZK nodes you have, the slower the ZK becomes for writes.
Zookeeper is a master node, as such it can be collocated with other master services. Ideally, you would not want to collocate it with an HA service. It is quite light on memory and CPU requirements, but since is disk intensive, don't collocate it with disk-intensive services like Kafka or HDFS.
In general, Zookeeper doesn't actually require huge drives because it will only store metadata information for many services, It is common to use 100G to 250G for zookeeper data directory and logs which is fine of many cluster deployments. Moreover, it is recommended to set configuration for automatic purging policy of snapshots and logs directories so that it doesn't end up by filling all the local storage.
At Yahoo!, ZooKeeper is usually deployed on DEDICATED RHEL boxes, with dual-core processors, 2GB of RAM, and 80GB IDE hard drives.
For your Kafka/Storm cluster, you could consider deploying ZK on DEDICATED physical hardware (not virtual). The driving force for physical hardware or at least for the dedicated disk is the transaction log and the high throughput nature of Kafka and Storm.
Since Kafka is usually used with Storm, have a separate Zookeeper cluster for Kafka and Storm. Kafka and Storm are sharing then, please make sure that you don’t put the Zookeeper cluster on the Kafka nodes. Put the Zookeeper on the Storm nodes.
https://community.hortonworks.com/questions/2498/best-practices-for-zookeeper-placement.html
https://community.hortonworks.com/questions/55868/zookeeper-on-even-master-nodes.html
Apache ZooKeeper Essentials by Saurav Haloi Published by Packt Publishing, 2015