I am running HDP 2.4 on AWS cluster. I am trying to decommission a datanode but it appears the status is stuck at "Decommissioning" for long time. If i try to delete the host the below error message is being displayed. Could you please help ?
If the Replication factor >= no of datanodes currently and you try to decommission a data node , it may get stuck in Decommissioning state.
Ex: Replication factor is 3, Current data nodes =3. If you are trying to decommission a node now , it may get stuck in that state.
Make sure that no of data node matches the replication factor.
Reference : https://issues.apache.org/jira/browse/HDFS-1590
A datanode is a slave component of NameNode that shouldn't sit on the same host as the standby Namenode which is a Master process. With over 24 nodes you can afford to have at least 3 master nodes (Active & Standby NameNode ,Active & Standby RM and 3 zookepers)
Can you ensure the client components ain't running?
Get Datanode Info
$ hdfs dfsadmin -getDatanodeInfo datanodex:50020
Trigger a block report for the given datanode
$ hdfs dfsadmin -triggerBlockReport datanodex:50020
Can you update exclude file on both machine Resource manager and Namenode (Master Node), if it’s not there then we can create an exclude file on both the machines
$ vi excludes
Add the Datanode/Slave-node address, for decommissioning e.g
Run the following command in the Resource Manager:
$ yarn rmadmin -refreshNodes
This command will basically check the yarn-site.xml and process that property.and decommission the mentioned node from yarn. It means now yarn manager will not give any job to this node.
But the tricky part is even if all the datanodes components are not running deleting the datanode would impact your standby NameNode.
I recommend you move the standby Namenode to another host that should ONLY have a DataNode and Node Manager or at most client software like ZK,Kerberos etc
Unfortunately there is no FORCE command for decommissioning in Hadoop. Once you have the host in the excludes file and you run the yarn rmadmin -refreshNodes command that should trigger the decommissioning.
It isn't recommended and good architecture to have a NameNode and DataNode on the same host (Master and Slave/worker respectively) with over 24 nodes you should have planned 3 to 5 master nodes and strictly have DataNode,NodeManager and eg Zk client on the slave (workernodes).
Moving the NameNode to a new node and running the decommissioning will make your work easier and isolate your Master processes from the Slave this is the ONLY solution I see left for you.