There's no simple rule of thumb for this, it's as much an art as it is a science, as it depends on the workloads and how chatty they are with your current ZKs.
One way to look at this is:
If you have 3 ZKs you can
afford to lose one, if you have 5 you can afford to lose two.
If your IT is aggressively
applying security patches and other upgrades, like firmware, kernel, Java,
other packages used by Hadoop tools, and taking nodes down to do the job, then
during those upgrades with 3 ZKs, you ZK runs with only two nodes, and if you
are unlucky and one of them goes down, then your whole cluster will go down.
So, in this case 5 are better.
Warning: the more ZK nodes you have, the slower
the ZK becomes for writes.
Zookeeper is a master node, as such it can be collocated with
other master services. Ideally, you would not want to collocate it with an HA service. It is quite light on memory and CPU requirements, but since is disk intensive, don't collocate it with disk-intensive services like Kafka or HDFS.
In general, Zookeeper doesn't actually require huge drives because it will only store metadata information for many services, It is common to use 100G to 250G for zookeeper data directory and logs which is fine of many cluster deployments. Moreover, it is recommended to set configuration for automatic purging policy of snapshots and logs directories so that it doesn't end up by filling all the local storage.
Dedicated or Shared?
At Yahoo!, ZooKeeper is usually deployed on
DEDICATED RHEL boxes, with dual-core processors, 2GB of RAM, and 80GB IDE hard
Kafka/Storm cluster, you could consider deploying ZK on DEDICATED physical
hardware (not virtual). The driving force for physical hardware or at
least for the dedicated disk is the transaction log and the high throughput
nature of Kafka and Storm.
Since Kafka is usually used with Storm, have a separate
Zookeeper cluster for Kafka and Storm. Kafka and Storm are sharing then, please make
sure that you don’t put the Zookeeper cluster on the Kafka nodes. Put
the Zookeeper on the Storm nodes.
Rather than going to larger clusters of ZKs, it is better to split out certain services to their own ZKs when they're putting more pressure on an otherwise fairly quiet ZK cluster. It is a good thing to have separate set of ZK for each cluster, one quorum for Kafka, one quorum for Storm, one quorum for the rest (YARN, HBase, Hive, HDFS), possibly separate zookeepers for HBase. Challenge is that more hardware is needed and more administration, but it pays off.
Be careful where you put that transaction log. The most performance-critical part of ZooKeeper is the transaction log. ZooKeeper must sync transactions to media before it returns a response. A dedicated transaction log device is key to consistent good performance. Putting the log on a busy device will adversely impact performance. If you only have one storage device, put trace files on NFS and increase the snapshotCount; it doesn't eliminate the problem, but it can mitigate it. ZooKeeper's transaction log must be on a dedicated device. A dedicated partition is not enough. ZooKeeper writes the log sequentially, without seeking. Sharing your log device with other processes can cause seeks and contention, which in turn can cause multi-second delays.
Do not put ZooKeeper in a situation that can cause a swap. In order for ZooKeeper to function with any sort of timeliness, it simply cannot be allowed to swap. Remember, in ZooKeeper, everything is ordered, so if one request hits the disk, all other queued requests hit the disk. Going to disk unnecessarily will almost certainly degrade your performance unacceptably. Therefore, make certain that the maximum heap size given to ZooKeeper is not bigger than the amount of real memory available to ZooKeeper. Set your Java max heap size correctly. To avoid swapping, try to set the heapsize to the amount of physical memory you have, minus the amount needed by the OS and cache. The best way to determine an optimal heap size for your configurations is to run load tests. If for some reason you can't, be conservative in your estimates and choose a number well below the limit that would cause your machine to swap. For example, on a 4G machine, a 3G heap is a conservative estimate to start with.
The ZooKeeper data directory contains the snapshot and transactional log files. It is a good practice to periodically clean up the directory if the auto-purge option is not enabled. Also, an administrator might want to keep a backup of these files, depending on the application needs. However, since ZooKeeper is a replicated service, we need to back up the data of only one of the servers in the ensemble.
ZooKeeper uses Apache log4j as its logging infrastructure. As the logfiles grow bigger in size, it is recommended that you set the auto-rollover of the logfiles using the in-built log4j feature for ZooKeeper logs.
The list of ZooKeeper servers used by the clients in their connection strings must match the list of ZooKeeper servers that each ZooKeeper server has. Strange behaviors might occur if the lists don't match.
The server lists in each Zookeeper server configuration file should be consistent with the other members of the ensemble.
The ZooKeeper transaction log must be configured in a dedicated device. This is very important to achieve best performance from ZooKeeper.
The Java heap size should be chosen with care. Swapping should never be allowed to happen in the ZooKeeper server. It is better if the ZooKeeper servers have a reasonably high memory (RAM).
System monitoring tools such as vmstat can be used to monitor virtual memory statistics and decide on the optimal size of memory needed, depending on the need of the application. In any case, swapping should be avoided.