I currently have a three node cluster with the following configuration:
Master Node: NameNode, SecondaryNameNode, JobTracker, HMaster,
Slave Node: DataNode, TaskTracker, HRegionServer, Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop, Hive MetaStore, Database for Hive/Sqoop, HiveServer2, HCatalog, Oozie Server, Zookeeper,
Slave Node: DataNode, TaskTracker, HRegionServer, Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop
I wish to make a more realistic cluster. I have access to enough VMs to use any number of nodes. But with efficiency in mind, I am interested in knowing what the minimum number of nodes is necessary to have a "healthy" cluster, as per Cloudera Manager guidelines.
I was thinking of something like:
Master 0: Name Node Master 1: Secondary NameNode Master 2: JobTracker Master 3: HMaster Slave 0: DataNode, TraskTracker, HRegionServer Slave 1: DataNode, TraskTracker, HRegionServer Slave 2: DataNode, TraskTracker, HRegionServer Hive/Catalog Node: Hive MetaStore, Sqoop MetaStore MySQL/PostgreSQL Database for Hive/Sqoop, HCatalog, HiveServer (Or is it better to break HiveServer into its own node?) Oozie-Server (Or is it better to break Oozie-server into its own node?) Zookeeper Ensemble: 3 Nodes with Zookeper installed
Client Node: Oozie-client, Hive-client, pig-client, M/R client tools, Sqoop
I've used the same MySQL instance for my Hive database and my Oozie database, and figured that would be ok to do again. I'm also figuring the HiveServer and the Oozie-server can run on the same host as the Hive/Oozie MetaStore, along with HCatalog.
Right now on my three node cluster, I have installed all the client software on every node so I can execute M/R, Hive, Oozie, HBase, Pig, etc. client calls from any node. Are these client tools supposed to be executed on a node separate from the Master and Slave nodes? Speaking of which, I have been putting all my java/python/pig code on the Master Node in my three node cluster. Is this data also better put on a separate client node?
I've been told that:
You can colocate the RegionServers with the Zookeeper nodes, just give the Zookeeper nodes their own disk.
Be careful co-locating the TaskTrackers and RegionServers, especially if most of your HBase usage is scan heavy. Both processes are quite CPU and memory intensive and can lead to resource contention issues. This page has more details on what to do in this situation
As far as code organization and client set up goes,that's really your call. I personally prefer setting up a few gateway nodes which have all of the configuration for talking to hive, hbase, etc. and running jobs from there, but again there is no perfect answer for that.
Am I on the right path here? What is the proper way to to make the smallest but "ideal" healthy cluster?
I imagine that Cloudera Manager will configure this all for me, except for maybe the client node. How many Nodes should I give to Cloudera Manager?
The minimum # of nodes you'll need really depends on what kind of load you'll be placing on this cluster, and how much data it'll be hosting. If it's just going to be a lightly used test or development cluster with scaled down workloads/data, you can get away with a lot smaller node count than if you really want to do serious performance testing/production readiness type of workloads.
That being said, I'll comment on your proposed cluster:
1) You can probably get away with just two master nodes. Namenode on Master0, the rest of the Master roles on Master1. These roles are relatively lightweight in the big scheme of things, they just need to be on reliable systems.
2) I like your suggestion of 3 slave nodes as this enables HDFS to build 3 replicas of blocks, each on it's own node. I also think you're fine with the 3 roles you've placed on those nodes (DN, TT, RS), but I think you can also do away with your separate ZK ensemble and just add a ZK server to each of these slave nodes too. It's good to have 3 separate ZK instances on separate machines, but ZK is pretty lightweight and can be co-located with your other roles in a starter cluster.
3) Depending on what kind of workloads you're planning, I'm not sure that you couldn't just place your Hive/Oozie servers/dbs on one of your Master nodes too. I don't see anything wrong with your suggestion of placing all of these on their own unique server, though. I wouldn't think you'd need more than one server for these roles right away, however.
4) a separate "Gateway" machine which handles all your client interactions with the cluster is a good idea. You can place your custom applications there. CM supports creating this machine for you, it's a role called "Gateway" that you can add at your leisure.
Finally, the advice you were given is accurate. But keep in mind that it's not just placing task trackers and regionservers together that requires caution. It all depends on how you are using them. I've seen lots of successful production clusters where these roles are colocated, you just have to CAREFULLY plan and test any use cases that will be heavily loading either MR, HBase, or both as these heavy workloads can lead to resource contention if you're not careful. Tuning can be done to make them play nicely, though.
...just a small addition to Clint's great explanation:
For the CM you just need one server, and regular backups of the CM-database (nevertheless which one you use) ;)
A crashed CM server doesn't interrupt your Hadoop cluster, the mentioned backup just ensures to recreate a CM instance in case of a failure.