Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Help with Role Assignments / New Install

avatar
Contributor

Hi, I'm new to Cloudera Standard.  We [will] have a 7 node cluster (5 TaskTrackers + 2 NameNodes/JobTrackers).  In addition I have one VM to host the Cloudera Manager. I've ran through the installation process a few times to get familiar, and my question is how best to distribute the role assignments.  

 

We need HDFS, HBase, MapReduce, Hive, Hue, Sqoop, and Impala.  Can someone tell me if I'm on the right track with the assignments here?

 

  1. NameNode:  JobTracker, Sqoop, Impala StateStore Daemon, Cloudera Mgmt Services (Monitors)
  2. SecNameN:  JobTracker, (put other services here?)
  3. DataNode01:  TaskTracker, Impala Daemon
  4. DataNode02:  TaskTracker, Impala Daemon
  5. DataNode03:  TaskTracker, Impala Daemon
  6. DataNode04:  TaskTracker, Impala Daemon
  7. DataNode05:  TaskTracker, Impala Daemon
  8. Admin01 (VM):  Cloudera Manager

My biggest questions lie with the Hive and HBase assignments across the nodes:

  • Hive Gateway
  • Hive MetaStore
  • HiveServer2
  • HBase Master
  • HBase RegionServer
  • HBase Thrift Server (not sure if I need)

Any recommendations would be greatly appreciated!  Thanks!

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Is this for proof of concept/discovery?  This would be a tightly packed cluster you are proposing for most production environments IMHO .  Our account team provides Solutions Engineering /Solutions Architecture guidance for things like this as you begin to scope out revenue generating activity with a cluster and want enterprise support.

 

Review our blogs discussion here on hardware sizing:

http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/

 

As a general rule of thumb, try to minimize the services co-located with the NameNodes (NN). JN's and ZK nodes are OK to co-locate with them, we recommend JN's be co-located with the NN's + 1.  

 

You have a over-loaded mix of services here in the combination of HBase and Mapreduce services on the nodes (there will be battles for memory if you are under-sized.)

 

If this is a dev or non prod footprint you can get by with what I'm proposing below... HBase can take a lot of memory so you want to monitor. MR jobs are variable based on the complexity and size of what you are doing.

 

Secondary Name Node (SNN) as an architecture is less safe then using Name Node High Availability (NN HA).  SNN does not do what you think it does.  Read the hadoop operations guide book (3rd edition) to have a better sense of this.

 

Once you enable NN HA  you end up deploying 3 zookeeper instances and jounal nodes, so based on what you are presenting you are saddling up for future outage / loss of data if this ends up in prod this way unless you are really really careful (and even then you could get hit).


This footprint viability really depends on your workload... so you might end up relocating things once you start observing activity, what I'm proposing below is, at best, a playpen configuration so  you can kick the tires and check stuff out.

 

You are using 3 separate DB implementations with Impala (Fast SQL perf), Hive (Slower SQL but more std SQL support) and HBase (Column based DB)... Is your design really requiring all 3 (research them a little bit more, add them if it makes sense after initial deployment). HUE is usually in the mix to facilitate end user web based access too.


Realize you can move services after deployment.  Note that decommissioning a DataNode takes a while to replicate data blocks off the node to the rest.


Read up on Hive and Hive2. Hive2 is for more secure implementations (plus other stuff)

 

  1. NN Active:  Jounal Node, ZooKeeper JobTracker, 
  2. NN (standby):  Journal Node, ZK, JobTracker (if using JT HA?)
  3. HBaseMaster   ZK, All Hive + Metastore, Sqoop, Impala StateStore Daemon, 
  4. DataNode01:  TaskTracker, Impala Daemon, HBase RS
  5. DataNode02:  TaskTracker, Impala Daemon, HBase RS
  6. DataNode03:  TaskTracker, Impala Daemon,  HBase RS
  7. DataNode04:  TaskTracker, Impala Daemon,  HBase RS
  8. Admin01 (VM):  Cloudera Manager Cloudera Mgmt Services (Monitors), DB for Monitoring.  Note that using a VM for this will be a "heavy vm" with regards to network IO/Disk IO as cluster activity scales.

View solution in original post

11 REPLIES 11

avatar
Master Collaborator

Is this for proof of concept/discovery?  This would be a tightly packed cluster you are proposing for most production environments IMHO .  Our account team provides Solutions Engineering /Solutions Architecture guidance for things like this as you begin to scope out revenue generating activity with a cluster and want enterprise support.

 

Review our blogs discussion here on hardware sizing:

http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/

 

As a general rule of thumb, try to minimize the services co-located with the NameNodes (NN). JN's and ZK nodes are OK to co-locate with them, we recommend JN's be co-located with the NN's + 1.  

 

You have a over-loaded mix of services here in the combination of HBase and Mapreduce services on the nodes (there will be battles for memory if you are under-sized.)

 

If this is a dev or non prod footprint you can get by with what I'm proposing below... HBase can take a lot of memory so you want to monitor. MR jobs are variable based on the complexity and size of what you are doing.

 

Secondary Name Node (SNN) as an architecture is less safe then using Name Node High Availability (NN HA).  SNN does not do what you think it does.  Read the hadoop operations guide book (3rd edition) to have a better sense of this.

 

Once you enable NN HA  you end up deploying 3 zookeeper instances and jounal nodes, so based on what you are presenting you are saddling up for future outage / loss of data if this ends up in prod this way unless you are really really careful (and even then you could get hit).


This footprint viability really depends on your workload... so you might end up relocating things once you start observing activity, what I'm proposing below is, at best, a playpen configuration so  you can kick the tires and check stuff out.

 

You are using 3 separate DB implementations with Impala (Fast SQL perf), Hive (Slower SQL but more std SQL support) and HBase (Column based DB)... Is your design really requiring all 3 (research them a little bit more, add them if it makes sense after initial deployment). HUE is usually in the mix to facilitate end user web based access too.


Realize you can move services after deployment.  Note that decommissioning a DataNode takes a while to replicate data blocks off the node to the rest.


Read up on Hive and Hive2. Hive2 is for more secure implementations (plus other stuff)

 

  1. NN Active:  Jounal Node, ZooKeeper JobTracker, 
  2. NN (standby):  Journal Node, ZK, JobTracker (if using JT HA?)
  3. HBaseMaster   ZK, All Hive + Metastore, Sqoop, Impala StateStore Daemon, 
  4. DataNode01:  TaskTracker, Impala Daemon, HBase RS
  5. DataNode02:  TaskTracker, Impala Daemon, HBase RS
  6. DataNode03:  TaskTracker, Impala Daemon,  HBase RS
  7. DataNode04:  TaskTracker, Impala Daemon,  HBase RS
  8. Admin01 (VM):  Cloudera Manager Cloudera Mgmt Services (Monitors), DB for Monitoring.  Note that using a VM for this will be a "heavy vm" with regards to network IO/Disk IO as cluster activity scales.

avatar
Contributor

Thank you for the detailed response Tgrayson.  We probably will not be running all of these services in production, but I wanted to get a sense of where we would be with our [new] cluster resources in production should that become the case.   A few follow up questions assuming we did NOT run Hbase in production:

 

  1. Could the server you designated as HBaseMaster be transitioned back to DataNode05 and assigned as a TaskTracker role, or do we still need a dedicated node for:  ZK, All Hive + Metastore, Sqoop, Impala StateStore Daemon?
  2. I only see two JN's... should there be a 3rd?
  3. I'm not sure why you say this:  "Once you enable NN HA  you end up deploying 3 zookeeper instances and journal nodes, so based on what you are presenting you are saddling up for future outage / loss of data".  Why is that the case if I enable HA with 3 JN's and 3 ZK's?

Thank you kindly. This is quite helpful.

avatar
Master Collaborator

 

#3 - You were presenting a config based on using SNN instead of NN HA w. ZK & JN, thus the comment. SNN is not present in a NN HA config, they become NN (active) and NN (standby). Secondary NN is the older integration pattern. Its a horrible name for what it actually provided as functionality for protecting the cluster from failure (which was nothing). It was a place to offload work to.

#2 - You would need to evaluate workload on the NN/TT nodes to decide if you would be able to get by with that. 

#1 - Yes, the 3rd node in the list would have the missing JN

 

Understand there are implementations with hive + hbase out there, you just want to avoid co-locating the NN with HBase services.

 

Todd

avatar
Contributor

Okay I see.  Well I guess I could bring up a decent sized VM to offload ZK, All Hive + Metastore, Sqoop, Impala StateStore Daemon instead of burdening the 3rd node on the list with it.  

 

It's like a hodgepodge of services under the "Hadoop" umbrella now which makes it tough to size if you haven't used them all together before.

 

Todd, thank you for your assistance.

avatar
Contributor

By the way, during the install via Cloudera Manager when choosing "Inspect Role Assignments", you are forced to choose a secondary namenode versus setting up HA.  Am I missing something?

avatar
Guru

Yes, that's the way the process still happens.  Once you get the HDFS service installed and running, setting up HA is a separate workflow allowing you to choose your fencing mechanism, manual or automatic failover, quorum journal nodes, etc.

 

I believe this is the doc you will need.

avatar
Contributor

Thank you.  I will restart my install again (testing on snapped VM's) because I'm stuck on a few errors.  One is "Federated SecondaryNameNode secondarynamenode is not configured with a Nameservice" which I can't figure out how to remove via CM.

 

I presume the "Shared Edits Directory / dfs.namenode.shared.edits.dir" value is not needed if using Quorum based HA?  I'm used to using a common NFS directory but it looks like that isn't needed any longer. 

 

I don't want to turn this thread in to a installation support ticket for your sake, so that is my final question.  I work through the rest.  Thank you.

avatar
Guru

You are correct.  It's either the NFS-based shared edits directory OR the QJM-based HA config.

avatar
Contributor

Just an update.  Although I don't want to, I may have try a manual install instead of using CM or maybe just install vanilla Hadoop.  The CM installation just seems really inconsistent and throws errors without any explanations.  For instance, I am now stuck on "Starting your Cluster Services", and it always fails when trying to start the ZK service, and the NameNode formatting always fails also (according to the CM wizard).  Then I am stuck on this screen and can't continue.  There are no errors up to this point and this is my 7th try to install via CM trying different scenario's hoping to get a clean finish.

 

The ZK error in the Stderr log file is "Unable to access datadir, exiting abnormally" although when I view that dir, I can see that it has written the "myid" file.  I even change the owner of that dir to the zookeeper user/group but no dice.  Ugh.