- Since the decommission of DataNodes can take a long time if you have a lot of blocks; HDFS needs to replicate blocks belonging to decommissioning DataNodes to other live DataNodes to reach the replication factor that you have specified via dfs.replication in hdfs-site.xml. The default value of this replication factor is 3.
- You can get feedback from the decomission process e.g from NameNode UI: http://ip of_namenode:50070/dfshealth.html#tab-datanode or you can use command line tools like "hdfs fsck /"
- Cloudbreak periodically polls Ambari about the status of decomissioning and Ambari monitors the NameNode
- if the decomissioning is finished then Cloudbreak removes the node from Ambari and delete the decomissioned VMs from cloud providers
b) For Scale Up, would we need to manually kick off hdfs rebalance?
- Cloudbreak does not trigger HDFS rebalance.
c) How do you know if you have lost a block, example if you scale down 8 out of your 10 nodes, how would hdfs handle this case. Assuming you have enough storage in the 2 nodes.
- HDFS: If you do not have enough live DataNodes to reach the replication factor, decommission process would hang until more DataNodes become available (e.g., if you have 10 DataNodes in your cluster with dfs.replication is set to 3 then you are able to scale down you cluster to 3 nodes)
- Cloudbreak: if you have 10 DataNodes with replication factor of 3 then Cloudbreak don't even let you remove more than 7 instances and you get back a "Cluster downscale failed.: There is not enough node to downscale. Check the replication factor and the ApplicationMaster occupation." error message