I want to try enable HDFS namenode HA in my cluster, however,it need at least 3 zookeeper servers. I just wanted to ask how I should structure my cluster,is it a good idea to conside 4 master nodes and install on each one zookeperserver (along with other master components such as resource manager, first and additional name nodes)?
Does this apply for other services such as Kafka servers as well?
Can someone articulate and explain the pros and cons above structure?
In a classic production level HA setup, you will need both (HDFS & RM) running redundantly see the 2 attached screenshots these 2 master components can be co-located (HDFS+RM+zk active) on one host (HDFS+RM+zk standby) on the other master and extra zk on the third node. Zookeeper isn't resource intensive like Kafka so it can run on a node with lower specs and most important people tend to forget the network which should also be redundant so that the racks are connected/wired to redundant (routers, switches)
Brief you MUST have a least 3 zk servers at all costs.
Please reference this HCC kafka document to understand the challenges involved in production Kafka setup!!!
Hi, what about if I wanna add HBase (how many master nodes?) and Kafka, and Nifi servics as well? should I put them aside from master hosts of HDFS and Yarn and Zookeeper?
Hi it depends of available resources of the nodes where you are planing to install all services. If your master nodes have enough resources you can have (HDFS,YARN,Zookeeper,HBase) on same nodes. In case of Kafka and NiFi I will recommend to install on separate machines.
If you have 3 master nodes:
HBase is a very memory hungry application. Each node in HBase installation, called RegionServer, keeps a number of regions, or chunks of your data, in memory (if caching is enabled). Ideally, the whole table would be kept in memory but this is not possible with a TB dataset.
High Availability for HBase features the following functionality:
All HBase API and region operations are supported, including scans, region split/merge, and META table support (the META table stores information about regions)
However, consider carefully the following costs associated with using High Availability features:
Having said that it won't be a good idea to co-locate HBase Masters on the same host as the NN/RM HA services but rather on the data nodes The reason you want co-location is data locality of the hdfs client reads which the region server performs. Overtime, the hdfs system will arrange for the data being read to be replicated to the machines it is read from i.e. the region servers