Created 06-29-2016 11:22 AM
Dear folks,
I am currently trying to set up HDP 2.4 on a small cluster for carrying out PoC activities but I am clueless on the aspects of what heuristics to use for assigning masters and assigning slaves and clients after launching install wizard. I had started with documentation provided here: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/ch_Getting_Re...
Description of cluster:
Small cluster with 8 machines. Each machine having 8 GB RAM, 6-8 cores and 500 GB. One machine I am using for ambari, and rest 7 machines are for namenode, secondary namenode and data nodes. All the nodes are installed with CentOS 6. Availability and reliability aren’t of any concern as it is a PoC cluster where some algorithms will be tested out for its functionality.
Frameworks required on cluster: Hadoop, Hive, Pig, Oozie, HBase, Zookeeper, Spark, Storm and Sqoop, Kafka
In order to get my feet wet, I had chosen Ambari and HDP2.4.0 and ease of deploying a cluster has been a positive experience till now, with nice documentation and my decent knowledge of Linux.
Going forward I wanted to know from experts on what heuristics and logic do they use for assigning masters and slaves. Most of the resources that I have found on this community and elsewhere discuss about the heuristics on system configurations (RAM, memory and cores) and are pretty logically concluded for a heterogeneous cluster and the takeaways are important heuristics which could make clusters efficient.
But given a homogeneous cluster, I am totally clueless about how to proceed.
Any concrete or abstract ideas is much appreciated.
Best Regards,
Rahul
Created 06-29-2016 04:33 PM
@Rahul Mishra You may want to start with this documentation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_cluster-planning-guide/content/ch_hardwar.... For small clusters like yours where HA isn't a concern you basically are dealing with only 2 types of nodes - master and worker nodes. I certainly wouldn't over-architect it. For an 8 node cluster you would have your Ambari Server which can also hold your client services, 2 master nodes, and finally 5 worker nodes.
If you have a homogeneous cluster like yours where each node has low resources, you're primary concern is co-locating services requiring the same type of resources. For example, it would be ok to have an in-memory service like Spark co-exist with a more IO intensive service, but not 2 in-memory intensive services on the same node.
In your case you'll just have to build it out and monitor and be aware that running certain operations together may cause performance issues. The good thing about HDP is its ability to scale so you are never really quite "locked-in" to a particular architecture.
Created 06-29-2016 04:33 PM
@Rahul Mishra You may want to start with this documentation http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_cluster-planning-guide/content/ch_hardwar.... For small clusters like yours where HA isn't a concern you basically are dealing with only 2 types of nodes - master and worker nodes. I certainly wouldn't over-architect it. For an 8 node cluster you would have your Ambari Server which can also hold your client services, 2 master nodes, and finally 5 worker nodes.
If you have a homogeneous cluster like yours where each node has low resources, you're primary concern is co-locating services requiring the same type of resources. For example, it would be ok to have an in-memory service like Spark co-exist with a more IO intensive service, but not 2 in-memory intensive services on the same node.
In your case you'll just have to build it out and monitor and be aware that running certain operations together may cause performance issues. The good thing about HDP is its ability to scale so you are never really quite "locked-in" to a particular architecture.
Created 06-30-2016 05:26 AM
@Scott ShawThanks for the prompt answer!
Among the frameworks required for this cluster (Hadoop, Hive, Pig, Oozie, HBase, Zookeeper, Spark, Storm and Sqoop, Kafka) is there any classification on the basis of intensiveness of IO , Computation and Memory? I might be wrong but are there frameworks in the list which would be both IO intensive and Memory intensive?
Regards,
Rahul
Created 06-29-2016 06:43 PM
Following is an example based on the above comment:
Node1: Ambari Server, Primary Namenode, Zookeeper, Hbase Master, Clients
Node2: Secondary/Standby Namenode, Hive services, PIG, OOzie, zookeeper
Node3: YARN, Spark, Sqoop, Kafka, Ambari Metrics Collector/Grafana, zookeeper
Node4-8: Datanode, Nodemanager, Hbase Region Servers, Clients
Created 06-30-2016 05:27 AM
@vpoornalingam Many thanks for the prompt answer!
Created 06-30-2016 07:54 PM
You are welcome! Please accept @Scott Shaw's answer!
Created 07-01-2016 10:32 AM
@vpoornalingam One more doubt : where should I be putting History server, App timeline server and resource manager?
Created 12-08-2017 10:01 PM
Can I know why do you want to install clients in the Node1. It's master node right. From node4-8 if you install on these node it's fine so that clients can access the services from there. But why specifically on master node was my question.
Could you provide reason for it if possible.
Thanks
Created 07-06-2016 05:04 AM
History Server,App Timeline, and RM are part of YARN & MR master component, In the @Venkat layout, It will on node3 .Also, suggest placing Kafka in a different server as its an ingestion component.
Created 04-18-2017 11:30 AM
Hi, Your discussion is valuable, as in HDP documentation there is nearly no information regarding small cluster planning and HDP components' placement between servers...
How would you reconsider the configuration and components placement in case we have 3 master nodes and (on start) 3 data nodes. The number of data nodes will increase when needed. For master node we are planning - 32GB RAM and 250GB HDD (if needed can be more memory and HDD), for data node - 24GB RAM and 8TB HDD. Redundancy/HA of all components is a must as configuration will be used for production environment. Single node (even master) failure/restart/unavailability shall not disrupt any functionality. As a redundant database solution for all components there will be 3-node MySQL active cluster (probably Percona XTRADB Cluster) used for all components requiring DB access. It will be located on master nodes.
,
Hi,
Thank you for interesting information. Unfortunately in HDP documentation there is nearly no information regarding small cluster planning and components placement...
How would you reconsider the configuration in case we have 3 master nodes and (on start) 3 data nodes. The number of data nodes will increase when needed.
For master node- 32GB RAM and 250GB HDD (if needed can be more memory and HDD), for data node - 24GB RAM and 8TB HDD.
Redundancy/HA of all components is a must as it is for production environment. Single node (even master) failure/restart/unavailability shall not disrupt any functionality.
As a redundant database for all components there will be 3-node MySQL cluster (probably Percona XTRADB Cluster)
used. It will be located on master nodes.