Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to recover from CORRUPT HDFS state

New Contributor

We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on logging node 2. We are facing a terrible issue with our hadoop cluster recently. There are lot of files in HDFS in corrupt state. We are unable to figure out what cause this mass corruption and how to recover from it. HDFS has 40 TB of data and we are worried that we might have to rebuild the cluster from scratch due to this errors. Our cluster had some file system issues recently. Below is the list of events that took place before that. Both Hadoop and HBase are running on ln02 (logging node 2).

  • Nov 30 - SSD drives on ln02 node has died which triggered a kernel panic and reboot.
  • Dec 20 - ln02 file system set to Read-only and both hard drives on ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and rebooted, and it came back up. One data node was also down on the same day due to disk failure.
  • Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys admin replaced the failed SSD with another SSD. Another data node was down on the same day.

On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to restart Hadoop and HBase without any issue. Everything worked as expected. But on Dec 21st, when I restarted Hadoop, it has automatically switch to the "Safe mode" and hadoop fs fsck command showed lot of corrupt and missing files. Output of fsck is below.

10921-fsck.png

HDFS web ui shows below message.

Safe mode is ON. The reported blocks 391254 needs additional 412774 blocks to reach the threshold 0.9990 of total blocks 804832. The number of live datanodes 10 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

We experienced some data nodes showing Input/output errors intermittently as well.

Anyone experienced such situation before and any idea to recover from this is greatly appreciated.

3 REPLIES 3

Cloudera Employee

Seems like the blocks that are shown as missing were on the disks that went bad. Can you please provide the Namenode logs and also for one of the files that show missing in with the fsck can you please check the following:

hdfs fsck<path to the file> -files -blocks -locations -racks

New Contributor

Thank you. Here is the output of fsck.

10929-fsck2.png

Namenode logs is attached.

hadoop-hadoop-namenode-moe-ln02out.txt

New Contributor

Based on above information, I think first the namenode (ln02) metadata can be checked if it is not corrupt and accordingly namenode recovery can be tried (reference sof link). Corrupt metadata parts implies data loss (which is also reported as corrupt blocks). And then actually recovering corrupt/ missing blocks - this post suggests a solution, where the last option is to permanently delete files whose block are missing. After that safemode leave is okay. I hope this somewhat helps.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.