We are using automated procedures to decommission host and delete them from the cluster when they had hardware issues.
In the following post, Cloudera recommened uses API decommission.wait() call to make sure the datanode and tasktracker are fully decommissioned.
But in our enviroment, the decommssion of roles commands sometimes( espeically for datanode) took too long to complete. Most of the time I had to abort the decommissionoing command from CM UI. On the hand, the role commission status has alreaby become "Decommissioned"
My question is that, once the datanode role or tasktracker role commission state became "decommissioned", can I assum the role has been decommissioned successfully? If yes, I want to kill the running decommission command so I can go ahead deleting the host from CM.
When CM decomissioning the datanode, it will decomission the datanode quickly, but after that it will wait until the blocks in HDFS are replicated to their minimum replication level again.
1. that replication can be itself a time consuming process
2. if you had some blocks on the currently decomissioning datanode, which are only has one replica - and it was placed on the actually decomissioned datanode, then it cannot be replicated any more. This will make the process even longer, as CM has to be timed out on waiting for the replication to finish (but could not).
If you wan't to be safe, you should wait until the command finishes, before doing anything destructive to the host. In case you find out some data is missing from HDFS, it could be still there on that host.
Does this sense?
I still think the decommissioning command hung for some reason. It took more than 24 hours to run and I had to abot it.
At the same time, I saw the nameode reported that node as decommissioned and there were no missing block.