I'm seeing a sporadic problem in our environment that runs 2.0.0-cdh4.1.2.
At a given point in time, a given (not always the same though) data node will disappear from the "Live Nodes" tab in the name node Web UI.
A quick inspection at the node level shows that the HDFS process is running locally on the node. There are no hard warnings or errors in the hdfs daemon log and there are no hardware/OS errors reported back to /var/log/messages either.
Restarting the HDFS process at the problematic node resolves this problem right away.
Does anyone know what might be causing this? Is this a bug or is this something that we can fix?
I apologize for the delay on this one. Have you tried running fsck on HDFS during a time when one of your nodes has gone into this state? Specifically this command:
sudo -u hdfs hadoop fsck /
I'm curious if that shows any blocks as missing or under-replicated, and also if it finds any nodes to be dead. Does your NN log reflect anything regarding these DN's during the time when they disappear? I have not seen this before, so I'm just suggesting where I'd start.