Support Questions
Find answers, ask questions, and share your expertise

I’d like to add more data nodes to my cluster but they are of a different specification. What is the best way to approach this?

Explorer
 
1 ACCEPTED SOLUTION

Accepted Solutions

@Anil Bagga

@jk answered, Configuration Groups is the way to go. I will not repeat the references and the paragraph he pasted from the reference, but I would like to elaborate more on the practical aspects of using configuration groups not only for heterogenous infrastructure, but also for cases when the existent homogeneous infrastructure becomes heterogeneous due to hardware failures, e.g. failed drives which can make your current infrastructure to behave like heterogeneous. That is another scenario for using Configuration Groups.

From the configuration point of view, your best approach is to use Configuration Groups to manage various data nodes infrastructure, however, there is more to it. With the new servers you may have more and faster storage on those new nodes.Your YARN containers sizing is defined globally (RAM and cores) as such if some nodes can store more data you need to have more cores and RAM on those nodes to be able to process the data in all nodes with a similar performance to avoid long running tasks on some of the nodes which in case of MapReduce can lead to lose of overall performance. Also, don't forget to balance the data across all nodes after you add the new data nodes. Third and not last, test your applications and monitor resources use across your infrastructure. There are always ways to improve performance by improving the design of the SQL or application to leverage the infrastructure evenly for best parallelism.

View solution in original post

2 REPLIES 2

@Anil Bagga

You can use "Using Host Config Groups" feature of ambari.

Ambari initially assigns all hosts in your cluster to one, default configuration group for each service you install. For example, after deploying a three-node cluster with default configuration settings, each host belongs to one configuration group that has default configuration settings for the HDFS service. In Configs, select Manage Config Groups, to create new groups, re-assign hosts, and override default settings for host components you assign to each group.

https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.0.1/bk_ambari-user-guide/content/using_host_con...

@Anil Bagga

@jk answered, Configuration Groups is the way to go. I will not repeat the references and the paragraph he pasted from the reference, but I would like to elaborate more on the practical aspects of using configuration groups not only for heterogenous infrastructure, but also for cases when the existent homogeneous infrastructure becomes heterogeneous due to hardware failures, e.g. failed drives which can make your current infrastructure to behave like heterogeneous. That is another scenario for using Configuration Groups.

From the configuration point of view, your best approach is to use Configuration Groups to manage various data nodes infrastructure, however, there is more to it. With the new servers you may have more and faster storage on those new nodes.Your YARN containers sizing is defined globally (RAM and cores) as such if some nodes can store more data you need to have more cores and RAM on those nodes to be able to process the data in all nodes with a similar performance to avoid long running tasks on some of the nodes which in case of MapReduce can lead to lose of overall performance. Also, don't forget to balance the data across all nodes after you add the new data nodes. Third and not last, test your applications and monitor resources use across your infrastructure. There are always ways to improve performance by improving the design of the SQL or application to leverage the infrastructure evenly for best parallelism.

View solution in original post