We have an 'archive' cluster running HDFS/HBase/MR. It is used for searches of older data. It exists so we can keep our main production cluster separate and loaded with the recent data that clients need fast access to (via Solr).
The 'archive' cluster is not client facing, and due to available hardware, is sub-optimal in a couple of ways. Firstly, the replication factor is 2. Secondly the servers (Dell R210 ii's) only contain 2HDs, and in our set up these are not RAIDed, in order for us to store more data on each serer.
The problem waiting to happen of course is that one of the disks has the OS and a load of HDFS data, so when that drive dies we lose the data too. We can't decommission it tidily either, as Cloudera cannot communicate with it.
This happened in the past week. Long story short, the 2nd (non-OS) disk in the machine was fine, so I was hoping that HDFS would incorporate it's data back in, once the OS drive was reinstalled (it's the same hostname, ip etc). However, HDFS refuses to deal with the 2nd disk now, precisely because it had data on when I re-added the host.
In addition, HDFS is reporting 6 corrupt blocks. I can see that 3 of them are available on the always-healthy 2nd drive of the repaired host. So, my question is twofold:
- Is there a way to get HDFS to recognise the 2nd disk as the one that was there before, and accept it's contents, or
- is there a way to move the relevant block files somewhere so that HDFS will incorporate at least those back into the cluster. That way I'll be down to just 3 corrupt blocks, which I guess I'll be deleting.
Given that all the datanode/regionserver hosts in this cluster have the same physical setup, I'd like to get a process down if possible to minimise loss through corrupt blocks when we lose an OS drive.
I've read the other entries on here that deal with corrupt blocks, why they occur, how to delete them etc, but I don't think there's anything that addresses this particular scenario.