Need Recommendation on setting up HDP 2.5 Multi Node Cluster using Ambari.(Azure VM's)
Nodes -> 1 Edge Node, 2 NameNodes (Primary and Secondary), 3 DataNodes
Services i am planning to install are as follows
Master Node:NameNode, ResourceManager, HBase Master, Oozie Server, Zookeeper Server, PostgreSQL(For Ambari, Ranger DB), Ranger, Apache Atlas, Apache Spark, History Server, Spark Job History Server
Secondary Node: Secondary NameNode, HiveServer2, MySQL, WebHCat server, HiveMetaStore(PostgreSQL DB)?
Data Node: DataNode, NodeManager, RegionServer
Gateway Node: All the clients (HDFS, Hive, Spark, Pig, Sqoop, Tez, Yarn, HBase etc)
Later on i would be adding SAP HANA Vora services on top of these. So I wanted to decide on RAM, CPU and Hard Disks such that i donot run out of space and memory issues.
What should be the good configuration in above case? And how should i distribute install above services across cluster nodes? Shall i install clients on all nodes?
Hi @rahul gulati,
It is hard to tell what is the optimal sizing for you, and I am afraid I cannot give you such numbers, since it depends on how much data would you like to store, what jobs you are planning to run, whether they are memory / cpu / disk intensive etc. If you don't have a massive amount of data then, maybe it is also a good approach (since you are in the cloud) that you start with some instances and if they look too small then just launch a new cluster with larger master nodes and copy all of your data over.
Hortonwoks has a HDCloud product, which is based on Cloudbreak, and there is a sizing guidelines for that product, which could be a good starting point for you: http://hortonworks.github.io/hdp-aws/create/index.html
Thanks for replying. We would be provisioning cluster using Ambari instead of cloudbreak. And we would not be handling too large data volume. It would be around 1-2 TB. I noticed you mentioned that in cloud we can start new instances with increased configuration(RAM and Storage). That sounds great. I just wanted to ask that what approach would be preferable in that case to back up the data from already running cluster to new cluster?
There would be OS disks, permanent disks, files to be backed up. Are there any guidelines/reference links on backing up the data from already running 6 VM's to new VM's.? And what if IP/hostname of new VM's gets changed? I think in that case we need to install Ambari Server and Ambari agents again. Please correct me if in am wrong?