Created 02-15-2017 07:34 AM
On HDFS 0.20.2, yes, it's old, 2 datanodes in our prod cluster no longer can start up.
The namenode says:
2017-02-15 09:24:52,861 FATAL org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage. 2017-02-15 09:24:52,862 INFO org.apache.hadoop.ipc.Server: IPC Server handler 58 on 9000, call register(DatanodeRegistration(cernsrchhadoop504.cernerasp.com:50010, storageID=DS-1574636665-44.128.6.253-50010-1461251397876, infoPort=50075, ipcPort=50020)) from 44.128.6.253:51326: error: org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage. org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage.
The kicker though, is that it's saying that datanode cernsrchhadoop504 can't serve that storage, as it's expected to be served by 44.128.6.253, which is actually cersnrchhadoop504
SFrom the namenode:
root@cernsrchhadoop388.cernerasp.com:~ ( cernsrchhadoop388.cernerasp.com ) 09:28:10 $ nslookup 44.128.6.253 Server: 127.0.0.1 Address: 127.0.0.1#53 Non-authoritative answer: 253.6.128.44.in-addr.arpa name = cernsrchhadoop504.cernerasp.com.
Datanode logs are saying similar on 504
2017-02-15 09:24:52,866 ERROR datanode.DataNode (DataNode.java:main(1372)) - org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage.
So for the question, how can I get the namenode to realize that the node it is expecting to have that storage is actually the same node that's attempting to serve that storage?
Created 02-15-2017 11:57 AM
Turned out that the nodes were in the excludes files, just not the host.exclude like we use in CDH5, so it was missed.
Created 02-15-2017 07:43 AM
Also, to just go over what we've attempted, we've cycled the datanode (or at least attempted to), rebooted the node, and since we found HDFS-1106 where someone had the same issue, did a refresh, but still can't get it to start.
Created 02-15-2017 11:57 AM
Turned out that the nodes were in the excludes files, just not the host.exclude like we use in CDH5, so it was missed.