Created 05-04-2016 06:51 PM
Need recommendation for a small 7 Node cluster. Below is what I am planning to do:
MasterNode: NameNode, ResourceManager, HBase Master, Oozie Server, Zookeeper Serve
DataNode: DataNode, NodeManager, RegionServer
Web interface: Ambari server / HUE interface / Zeppelin / Ranger
Gateway Node: All the clients (HDFS, Hive, Spark, Pig, Mahout, Tez etc)
SecondaryNode: Secondary NameNode, HiveServer2, MySQL, WebHCat server, HiveMetaStore
Any issue with this configuration. Also do we need the client on all the machines ?
Should I go with HDP 2.3 or 2.4 ?
Thanks
Prakash
Created 05-04-2016 07:09 PM
"Should I go with HDP 2.3 or 2.4 ?"
I would tend to 2.4 here. Although it depends a bit what you need. An older release of 2.3 may provide some extra stability but a lot of security features ( kerberos for Kafka->Spark Streaming etc. ) and a new spark release are in 2.4 ( and other goodies ). Also upgrading to a point version is normally easier than jumping releases. But again depends on your needs.
"Any issue with this configuration. Also do we need the client on all the machines "
For most cases not ( Sqoop action with hive in oozie needs the hive clients on all nodes but that is an exception )
Regarding your node distribution:
How many datanodes are you planning? I see 5 different node types so I assume you want 3 master nodes one edge node and 3 datanodes?
That may make sense if you plan to grow the cluster later but if you want to get the maximum amount of work done I would rather go for 2 master nodes and perhaps even reuse one of them as edge nodes. And have a decent amount of datanodes. Obviously depends on your server size as well but big modern servers with 12+ cores and 256GB of RAM can host an awful lot of master components at the same time without creating a bottleneck. Others may disagree with me here but I setup a 7 datanode plus 1 master+edge node cluster once ( didn't design it ) and it worked fine as long as you do not expect constant uptime for your cluster ( colocating this many services increases the chance of something going wrong and bringing down the whole cluster because of a server reboot so its nothing you would do for a mission critical system that cannot go down. If you have much smaller servers then you might need more master nodes as well though.
Created 05-04-2016 07:09 PM
"Should I go with HDP 2.3 or 2.4 ?"
I would tend to 2.4 here. Although it depends a bit what you need. An older release of 2.3 may provide some extra stability but a lot of security features ( kerberos for Kafka->Spark Streaming etc. ) and a new spark release are in 2.4 ( and other goodies ). Also upgrading to a point version is normally easier than jumping releases. But again depends on your needs.
"Any issue with this configuration. Also do we need the client on all the machines "
For most cases not ( Sqoop action with hive in oozie needs the hive clients on all nodes but that is an exception )
Regarding your node distribution:
How many datanodes are you planning? I see 5 different node types so I assume you want 3 master nodes one edge node and 3 datanodes?
That may make sense if you plan to grow the cluster later but if you want to get the maximum amount of work done I would rather go for 2 master nodes and perhaps even reuse one of them as edge nodes. And have a decent amount of datanodes. Obviously depends on your server size as well but big modern servers with 12+ cores and 256GB of RAM can host an awful lot of master components at the same time without creating a bottleneck. Others may disagree with me here but I setup a 7 datanode plus 1 master+edge node cluster once ( didn't design it ) and it worked fine as long as you do not expect constant uptime for your cluster ( colocating this many services increases the chance of something going wrong and bringing down the whole cluster because of a server reboot so its nothing you would do for a mission critical system that cannot go down. If you have much smaller servers then you might need more master nodes as well though.
Created 05-04-2016 07:55 PM
Thank you so much Benjamin. We are starting with small size cluster with 2 MasterNode (32GB each), 4 DataNode and 1EdgeNode.