Support Questions

mikejeezy · ‎10-25-2013

Hi, I'm new to Cloudera Standard. We [will] have a 7 node cluster (5 TaskTrackers + 2 NameNodes/JobTrackers). In addition I have one VM to host the Cloudera Manager. I've ran through the installation process a few times to get familiar, and my question is how best to distribute the role assignments.

We need HDFS, HBase, MapReduce, Hive, Hue, Sqoop, and Impala. Can someone tell me if I'm on the right track with the assignments here?

NameNode: JobTracker, Sqoop, Impala StateStore Daemon, Cloudera Mgmt Services (Monitors)
SecNameN: JobTracker, (put other services here?)
DataNode01: TaskTracker, Impala Daemon
DataNode02: TaskTracker, Impala Daemon
DataNode03: TaskTracker, Impala Daemon
DataNode04: TaskTracker, Impala Daemon
DataNode05: TaskTracker, Impala Daemon
Admin01 (VM): Cloudera Manager

My biggest questions lie with the Hive and HBase assignments across the nodes:

Hive Gateway
Hive MetaStore
HiveServer2
HBase Master
HBase RegionServer
HBase Thrift Server (not sure if I need)

Any recommendations would be greatly appreciated! Thanks!

Grizzly · ‎10-27-2013

Is this for proof of concept/discovery? This would be a tightly packed cluster you are proposing for most production environments IMHO . Our account team provides Solutions Engineering /Solutions Architecture guidance for things like this as you begin to scope out revenue generating activity with a cluster and want enterprise support.

Review our blogs discussion here on hardware sizing:

http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/

As a general rule of thumb, try to minimize the services co-located with the NameNodes (NN). JN's and ZK nodes are OK to co-locate with them, we recommend JN's be co-located with the NN's + 1.

You have a over-loaded mix of services here in the combination of HBase and Mapreduce services on the nodes (there will be battles for memory if you are under-sized.)

If this is a dev or non prod footprint you can get by with what I'm proposing below... HBase can take a lot of memory so you want to monitor. MR jobs are variable based on the complexity and size of what you are doing.

Secondary Name Node (SNN) as an architecture is less safe then using Name Node High Availability (NN HA). SNN does not do what you think it does. Read the hadoop operations guide book (3rd edition) to have a better sense of this.

Once you enable NN HA you end up deploying 3 zookeeper instances and jounal nodes, so based on what you are presenting you are saddling up for future outage / loss of data if this ends up in prod this way unless you are really really careful (and even then you could get hit).

This footprint viability really depends on your workload... so you might end up relocating things once you start observing activity, what I'm proposing below is, at best, a playpen configuration so you can kick the tires and check stuff out.

You are using 3 separate DB implementations with Impala (Fast SQL perf), Hive (Slower SQL but more std SQL support) and HBase (Column based DB)... Is your design really requiring all 3 (research them a little bit more, add them if it makes sense after initial deployment). HUE is usually in the mix to facilitate end user web based access too.

Realize you can move services after deployment. Note that decommissioning a DataNode takes a while to replicate data blocks off the node to the rest.

Read up on Hive and Hive2. Hive2 is for more secure implementations (plus other stuff)

NN Active: Jounal Node, ZooKeeper JobTracker,
NN (standby): Journal Node, ZK, JobTracker (if using JT HA?)
HBaseMaster ZK, All Hive + Metastore, Sqoop, Impala StateStore Daemon,
DataNode01: TaskTracker, Impala Daemon, HBase RS
DataNode02: TaskTracker, Impala Daemon, HBase RS
DataNode03: TaskTracker, Impala Daemon, HBase RS
DataNode04: TaskTracker, Impala Daemon, HBase RS
Admin01 (VM): Cloudera Manager Cloudera Mgmt Services (Monitors), DB for Monitoring. Note that using a VM for this will be a "heavy vm" with regards to network IO/Disk IO as cluster activity scales.

View solution in original post

Grizzly · ‎10-27-2013