We have a cluster that we use for searches of older data. It is sometimes used in a production context by us, but it is not client facing and we don't consider it a full production cluster. Therefore the replication factor is 2, and the servers we use only contain 2 hard drives.
The cluster only runs HDFS, HBase and MRv1
The hard drives are not RAIDed, so the setup is: OS on one drive along with /data/disk1, and the other drive is just /data/disk2, mounted to that directory on disk1.
Inevitably, we had a crash on an OS disk last week. Actually, the disk went ro, which prevented us from decommissioning the node in the usual way. We had no choice but to just switch it off, replace the drive, and reinstall the OS.
Which is what we did. We then rejoined the node to the cluster (we had previously had to completely remove it from CM). So the data on disk2, which did not fail and did not have the OS installed, was untouched and theoretically available.
However, HDFS does not like that disk, and states that the node has a volume failure. Also, HDFS is reporting 6 corrupt blocks. By using the 'find ./ -type f -iname <blk_id>' command, I have found that 3 of the 6 blocks are indeed present on the healthy (but unrecognised) drive.
So I guess my question is two-fold: In this situation is there any way to get HDFS to recognise the healthy drive as once having belonged to the cluster, and to process it's blocks accordingly, or if not - is there anyway I can restore the blk_id files from that drive anywhere in HDFS so I can recover 50% of the missing blocks?
Given that every datanode in the cluster is setup on the same way (i.e. spof on OS disk), I would like to have a process to minimise data losses when the inevitable happens.