We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on logging node 2. We are facing a terrible issue with our hadoop cluster recently. There are lot of files in HDFS in corrupt state. We are unable to figure out what cause this mass corruption and how to recover from it. HDFS has 40 TB of data and we are worried that we might have to rebuild the cluster from scratch due to this errors. Our cluster had some file system issues recently. Below is the list of events that took place before that. Both Hadoop and HBase are running on ln02 (logging node 2).
On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to restart Hadoop and HBase without any issue. Everything worked as expected. But on Dec 21st, when I restarted Hadoop, it has automatically switch to the "Safe mode" and hadoop fs fsck command showed lot of corrupt and missing files. Output of fsck is below.
HDFS web ui shows below message.
Safe mode is ON. The reported blocks 391254 needs additional 412774 blocks to reach the threshold 0.9990 of total blocks 804832. The number of live datanodes 10 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
We experienced some data nodes showing Input/output errors intermittently as well.
Anyone experienced such situation before and any idea to recover from this is greatly appreciated.
Seems like the blocks that are shown as missing were on the disks that went bad. Can you please provide the Namenode logs and also for one of the files that show missing in with the fsck can you please check the following:
hdfs fsck<path to the file> -files -blocks -locations -racks
Based on above information, I think first the namenode (ln02) metadata can be checked if it is not corrupt and accordingly namenode recovery can be tried (reference sof link). Corrupt metadata parts implies data loss (which is also reported as corrupt blocks). And then actually recovering corrupt/ missing blocks - this post suggests a solution, where the last option is to permanently delete files whose block are missing. After that safemode leave is okay. I hope this somewhat helps.