Currently we have cluster with 40 Data Nodes, 4 Master Node and 1 Edge Node. So due to large data size to achieve better performance we are thinking to increase one more edge node will this be beneficial for us ? And if beneficial how much % efficiencieny will increase? Can anyone please guide us. Thanks !
Edge/gateway nodes are usually the interface between the Hadoop cluster and the outside network and this usually has client software installed.
Unless on your cluster you have attributed it another function? I can see how to gauge the % in efficiency .....but you could, for example, restrict a group of developers to access the cluster through a certain edgenode. You can gain some control on security aspect but other efficiency I am yet to know !!
Thanks @Geoffrey Shelton Okot for responding to my question. So Efficiency means that If in 1 edge node Scenario if I execute 2 Hive Queries, it is taking 50 min and 45 min(let's say), if I have 2 Edge Node and If I execute 1 - 1 queries on both the edge Node will it take less than 50 Min and 45 Min to execute it ?
Edge nodes are often used as staging areas for data being transferred into the Hadoop cluster. As such, Oozie, Pig, Sqoop, and management tools such as Hue and Ambari run well there and it doesn't store any hdfs data, it used for accessing cluster and processing/accessing the data. On the actual cluster nodes, you can devote the full resources on the node for actual filesystem i/o, MapReduce processing and hbase etc
If you really want to boost your cluster performance, you should think of adding more data nodes(worker nodes) because that's where the processing happens. This document contributed by @smanjee which give you the performance stats & gains to use as an argument.
Hope that helps