Created on 10-25-2013 09:24 AM - edited 09-16-2022 01:49 AM
Hi, I'm new to Cloudera Standard. We [will] have a 7 node cluster (5 TaskTrackers + 2 NameNodes/JobTrackers). In addition I have one VM to host the Cloudera Manager. I've ran through the installation process a few times to get familiar, and my question is how best to distribute the role assignments.
We need HDFS, HBase, MapReduce, Hive, Hue, Sqoop, and Impala. Can someone tell me if I'm on the right track with the assignments here?
My biggest questions lie with the Hive and HBase assignments across the nodes:
Any recommendations would be greatly appreciated! Thanks!
Created 10-27-2013 03:58 PM
Is this for proof of concept/discovery? This would be a tightly packed cluster you are proposing for most production environments IMHO . Our account team provides Solutions Engineering /Solutions Architecture guidance for things like this as you begin to scope out revenue generating activity with a cluster and want enterprise support.
Review our blogs discussion here on hardware sizing:
http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/
As a general rule of thumb, try to minimize the services co-located with the NameNodes (NN). JN's and ZK nodes are OK to co-locate with them, we recommend JN's be co-located with the NN's + 1.
You have a over-loaded mix of services here in the combination of HBase and Mapreduce services on the nodes (there will be battles for memory if you are under-sized.)
If this is a dev or non prod footprint you can get by with what I'm proposing below... HBase can take a lot of memory so you want to monitor. MR jobs are variable based on the complexity and size of what you are doing.
Secondary Name Node (SNN) as an architecture is less safe then using Name Node High Availability (NN HA). SNN does not do what you think it does. Read the hadoop operations guide book (3rd edition) to have a better sense of this.
Once you enable NN HA you end up deploying 3 zookeeper instances and jounal nodes, so based on what you are presenting you are saddling up for future outage / loss of data if this ends up in prod this way unless you are really really careful (and even then you could get hit).
This footprint viability really depends on your workload... so you might end up relocating things once you start observing activity, what I'm proposing below is, at best, a playpen configuration so you can kick the tires and check stuff out.
You are using 3 separate DB implementations with Impala (Fast SQL perf), Hive (Slower SQL but more std SQL support) and HBase (Column based DB)... Is your design really requiring all 3 (research them a little bit more, add them if it makes sense after initial deployment). HUE is usually in the mix to facilitate end user web based access too.
Realize you can move services after deployment. Note that decommissioning a DataNode takes a while to replicate data blocks off the node to the rest.
Read up on Hive and Hive2. Hive2 is for more secure implementations (plus other stuff)
Created 10-27-2013 03:58 PM
Is this for proof of concept/discovery? This would be a tightly packed cluster you are proposing for most production environments IMHO . Our account team provides Solutions Engineering /Solutions Architecture guidance for things like this as you begin to scope out revenue generating activity with a cluster and want enterprise support.
Review our blogs discussion here on hardware sizing:
http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/
As a general rule of thumb, try to minimize the services co-located with the NameNodes (NN). JN's and ZK nodes are OK to co-locate with them, we recommend JN's be co-located with the NN's + 1.
You have a over-loaded mix of services here in the combination of HBase and Mapreduce services on the nodes (there will be battles for memory if you are under-sized.)
If this is a dev or non prod footprint you can get by with what I'm proposing below... HBase can take a lot of memory so you want to monitor. MR jobs are variable based on the complexity and size of what you are doing.
Secondary Name Node (SNN) as an architecture is less safe then using Name Node High Availability (NN HA). SNN does not do what you think it does. Read the hadoop operations guide book (3rd edition) to have a better sense of this.
Once you enable NN HA you end up deploying 3 zookeeper instances and jounal nodes, so based on what you are presenting you are saddling up for future outage / loss of data if this ends up in prod this way unless you are really really careful (and even then you could get hit).
This footprint viability really depends on your workload... so you might end up relocating things once you start observing activity, what I'm proposing below is, at best, a playpen configuration so you can kick the tires and check stuff out.
You are using 3 separate DB implementations with Impala (Fast SQL perf), Hive (Slower SQL but more std SQL support) and HBase (Column based DB)... Is your design really requiring all 3 (research them a little bit more, add them if it makes sense after initial deployment). HUE is usually in the mix to facilitate end user web based access too.
Realize you can move services after deployment. Note that decommissioning a DataNode takes a while to replicate data blocks off the node to the rest.
Read up on Hive and Hive2. Hive2 is for more secure implementations (plus other stuff)
Created 10-28-2013 03:20 PM
Thank you for the detailed response Tgrayson. We probably will not be running all of these services in production, but I wanted to get a sense of where we would be with our [new] cluster resources in production should that become the case. A few follow up questions assuming we did NOT run Hbase in production:
Thank you kindly. This is quite helpful.
Created 10-28-2013 05:57 PM
#3 - You were presenting a config based on using SNN instead of NN HA w. ZK & JN, thus the comment. SNN is not present in a NN HA config, they become NN (active) and NN (standby). Secondary NN is the older integration pattern. Its a horrible name for what it actually provided as functionality for protecting the cluster from failure (which was nothing). It was a place to offload work to.
#2 - You would need to evaluate workload on the NN/TT nodes to decide if you would be able to get by with that.
#1 - Yes, the 3rd node in the list would have the missing JN
Understand there are implementations with hive + hbase out there, you just want to avoid co-locating the NN with HBase services.
Todd
Created 10-29-2013 10:30 AM
Okay I see. Well I guess I could bring up a decent sized VM to offload ZK, All Hive + Metastore, Sqoop, Impala StateStore Daemon instead of burdening the 3rd node on the list with it.
It's like a hodgepodge of services under the "Hadoop" umbrella now which makes it tough to size if you haven't used them all together before.
Todd, thank you for your assistance.
Created 11-04-2013 01:41 PM
By the way, during the install via Cloudera Manager when choosing "Inspect Role Assignments", you are forced to choose a secondary namenode versus setting up HA. Am I missing something?
Created 11-04-2013 01:53 PM
Yes, that's the way the process still happens. Once you get the HDFS service installed and running, setting up HA is a separate workflow allowing you to choose your fencing mechanism, manual or automatic failover, quorum journal nodes, etc.
I believe this is the doc you will need.
Created 11-04-2013 02:16 PM
Thank you. I will restart my install again (testing on snapped VM's) because I'm stuck on a few errors. One is "Federated SecondaryNameNode secondarynamenode is not configured with a Nameservice" which I can't figure out how to remove via CM.
I presume the "Shared Edits Directory / dfs.namenode.shared.edits.dir" value is not needed if using Quorum based HA? I'm used to using a common NFS directory but it looks like that isn't needed any longer.
I don't want to turn this thread in to a installation support ticket for your sake, so that is my final question. I work through the rest. Thank you.
Created 11-04-2013 02:22 PM
You are correct. It's either the NFS-based shared edits directory OR the QJM-based HA config.
Created 11-05-2013 10:00 AM
Just an update. Although I don't want to, I may have try a manual install instead of using CM or maybe just install vanilla Hadoop. The CM installation just seems really inconsistent and throws errors without any explanations. For instance, I am now stuck on "Starting your Cluster Services", and it always fails when trying to start the ZK service, and the NameNode formatting always fails also (according to the CM wizard). Then I am stuck on this screen and can't continue. There are no errors up to this point and this is my 7th try to install via CM trying different scenario's hoping to get a clean finish.
The ZK error in the Stderr log file is "Unable to access datadir, exiting abnormally" although when I view that dir, I can see that it has written the "myid" file. I even change the owner of that dir to the zookeeper user/group but no dice. Ugh.