Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What are the best practices and recommendations for adding more datanodes to the large clusters in production?

avatar
 
1 ACCEPTED SOLUTION

avatar

@pardeep.kumar@hortonworks.com

Listing some them which i am aware of .

1). You could add either using Ambari Blueprints (https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-AddingHoststoanExistingCluster) or using Ambari.Blueprint is much easier to do.

2). After adding the data nodes run HDFS Balancer during quiet time.

3). Adjust the dfs.namenode.handler.count to ln(no of DNs)* 20

4). Adjust the dfs.namenode.service.handler to ln(no of DNs)* 20.

ln is log of.

Others can add /correct the recomendations.

View solution in original post

2 REPLIES 2

avatar

@pardeep.kumar@hortonworks.com

Listing some them which i am aware of .

1). You could add either using Ambari Blueprints (https://cwiki.apache.org/confluence/display/AMBARI/Blueprints#Blueprints-AddingHoststoanExistingCluster) or using Ambari.Blueprint is much easier to do.

2). After adding the data nodes run HDFS Balancer during quiet time.

3). Adjust the dfs.namenode.handler.count to ln(no of DNs)* 20

4). Adjust the dfs.namenode.service.handler to ln(no of DNs)* 20.

ln is log of.

Others can add /correct the recomendations.

avatar

HDFS Balancer can run in the background and there is a controllable bandwidth that it consumes. In general, on a large cluster it can run continuously, but it is a must after adding new nodes to have a healthy system. Note for large clusters a single convergence run can be a full day or more (that shouldn't scare you away though), let it run.

Also, some customers reported that had more stable experience when adding nodes in small batches of a few instead of adding a full rack at once, for example.