I'm getting a warning regarding some corrupt blocks replicas when using ambari or hdfs dfsadmin -report commands
$ hdfs dfsadmin -report Configured Capacity: 89582347079680 (81.47 TB) Present Capacity: 84363014526231 (76.73 TB) DFS Remaining: 46423774937044 (42.22 TB) DFS Used: 37939239589187 (34.51 TB) DFS Used%: 44.97% Under replicated blocks: 0 Blocks with corrupt replicas: 8 Missing blocks: 0 Missing blocks (with replication factor 1): 0
But when I'm trying to locate them using hdfs fsck / command, it doesn't find anything wrong.
Status: HEALTHY Total size: 11520245257654 B (Total open files size: 912343310015 B) Total dirs: 1269459 Total files: 1035091 Total symlinks: 0 (Files currently being written: 118) Total blocks (validated): 1071649 (avg. block size 10750017 B) (Total open file blocks (not validated): 6893) Minimally replicated blocks: 1071649 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-r eplicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 4 Number of racks: 1 FSCK ended at Wed Jun 13 16:29:25 RET 2018 in 25763 milliseconds The filesystem under path '/' is HEALTHY
How can I find thoses corrupted replicas and fix them ?
The namenode also tell me that it's ok but I've already encountered some issue when using spark jobs dealing with those files.
2018-06-13 16:30:25,870 INFO blockmanagement.BlockManager (BlockManager.java:computeReplicationWorkForBlocks(1660)) - Blocks chosen but could not be replicated = 8; of which 0 have no target, 0 have no source, 0 are UC, 0 are abandoned, 8 already have enough replicas.
Have you tried using the below snippet? Look through the output for missing or corrupt blocks (ignore under-replicated blocks for now)
$ hdfs fsck / | egrep -v '^\.+$' | grep -v replica
Once you find a file that is corrupt
$ hdfs fsck /path/to/corrupt/file -locations -blocks -files
Yes, already tried that but with no results.
After some time, the hadoop process ends up by correctif the issue by itself but I'm trying to understand where is the difference between the two commands, they should return the same diagnosis.