Support Questions

Find answers, ask questions, and share your expertise

What are the best practices and recommendations for adding more datanodes to the large clusters in production?

avatar
 
1 ACCEPTED SOLUTION

avatar

@pardeep.kumar@hortonworks.com

Listing some them which i am aware of .

1). You could add either using Ambari Blueprints (https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-AddingHoststoanExistingCluster) or using Ambari.Blueprint is much easier to do.

2). After adding the data nodes run HDFS Balancer during quiet time.

3). Adjust the dfs.namenode.handler.count to ln(no of DNs)* 20

4). Adjust the dfs.namenode.service.handler to ln(no of DNs)* 20.

ln is log of.

Others can add /correct the recomendations.

View solution in original post

2 REPLIES 2

avatar

@pardeep.kumar@hortonworks.com

Listing some them which i am aware of .

1). You could add either using Ambari Blueprints (https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-AddingHoststoanExistingCluster) or using Ambari.Blueprint is much easier to do.

2). After adding the data nodes run HDFS Balancer during quiet time.

3). Adjust the dfs.namenode.handler.count to ln(no of DNs)* 20

4). Adjust the dfs.namenode.service.handler to ln(no of DNs)* 20.

ln is log of.

Others can add /correct the recomendations.

avatar

HDFS Balancer can run in the background and there is a controllable bandwidth that it consumes. In general, on a large cluster it can run continuously, but it is a must after adding new nodes to have a healthy system. Note for large clusters a single convergence run can be a full day or more (that shouldn't scare you away though), let it run.

Also, some customers reported that had more stable experience when adding nodes in small batches of a few instead of adding a full rack at once, for example.