Support Questions

Find answers, ask questions, and share your expertise

what action is needed when we saw both - missing blocks and corrupt blocks

we have ambari cluster with HDP version 26 ( production system )

when we run the following command in order to verify which files have corrupted blocks

hdfs fsck / |egrep -v '^\.+$' | grep -v replica | grep -v Replica

we get:

/localF/STRZONEZone/intercept_by_country/2018/4/10/16/2018_4_10_16_45.parquet/part-00003-8600d0e2-c6b6-49b7-89cd-ef2a2bc1dc5e.snappy.parquet: CORRUPT blockpool BP-338831142- block blk_1097240348
/localF/STRZONEZone/intercept_by_country/2018/4/10/16/2018_4_10_16_45.parquet/part-00003-8600d0e2-c6b6-49b7-89cd-ef2a2bc1dc5e.snappy.parquet: MISSING 1 blocks of total size 1192 B...........................................

/localF/STRZONEZone/intercept_by_type/2018/4/10/16/2018_4_10_16_45.parquet/part-00002-be0f80a9-2c7c-4c50-b18d-73be372acff.snappy.parquet: CORRUPT blockpool BP-338831142- block blk_1097240344
/localF/STRZONEZone/intercept_by_type/2018/4/10/16/2018_4_10_16_45.parquet/part-00002-be0f80a9-2c7c-4c50-b18d-73be372acff.snappy.parquet: MISSING 1 blocks of total size 1098 B...............................................

..................................Status: CORRUPT
 Total size:7072689634566 B (Total open files size: 293676105509 B)
 Total dirs:32330710
 Total files:910568034
 Total symlinks:0 (Files currently being written: 12)
 Total blocks (validated):10183608 (avg. block size 6254517 B) (Total open file blocks (not validated): 2200)
  UNDER MIN REPL'D BLOCKS:2 (1.9345605E-5 %)
 Corrupt blocks:2
 Number of data-nodes:35
 Number of racks:1
FSCK ended at Mon Apr 20 11:40:50 UTC 2018 in 241684 milliseconds

The filesystem under path '/' is CORRUPT

in this case that we see :


what is the right action to do ? , or corrupted blocs solutuion ?




@Michael Bronson

Its important to determine the importance of the file, can it just be removed and copied back into place, or is there sensitive data that needs to be regenerated? If it's easy enough just to replace the file, that's the route I would take.

HDFS will attempt to recover the situation automatically. By default there are three replicas of any block in the cluster. so if HDFS detects that one replica of a block has become corrupt or damaged, HDFS will create a new replica of that block from a known-good replica, and will mark the damaged one for deletion.

The known-good state is determined by checksums which are recorded alongside the block by each DataNode.

This will list the corrupt HDFS blocks:

hdfs fsck -list-corruptfileblocks

This will delete the corrupted HDFS blocks:

hdfs fsck / -delete

Once you find a file that is corrupt

  hdfs fsck /path/to/corrupt/file -locations -blocks -files

Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.

You can use the reported block numbers to go around to the DataNodes and the NameNode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, DataNode not running, file system reformatted/reprovisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.

Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.

Once you determine what happened and you cannot recover any more blocks, just use the below command

  hdfs fs -rm /path/to/file/with/permanently/missing/blocks

command to get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.


@Michael Bronson

Yes that should delete the corrupt blocks notice the space between the / and -delete or simply using the -rm option see below

hdfs fs -rm /path/to/file/with/permanently/missing/blocks

To delete the first missing block in the case of your, output above this will be rebalanced with time or run manually the balancer i.e

hdfs fs -rm /localF/STRZONEZone/intercept_by_country/2018/4/10/16/2018_4_10_16_45.parquet/part-00003-8600d0e2-c6b6-49b7-89cd-ef2a2bc1dc5e.snappy.parquet

Hope that clarifies

just for summary ( this is production system )

the final steps are

  1. hdfs fsck / -delete
  2. and if "step 1" not fixed the corrupted blocks then we need to remove the file as:
  3. hdfs fs -rm /localF/STRZONEZone/intercept_by_country/2018/4/10/16/2018_4_10_16_45.parquet/part-00003-8600d0e2-c6b6-49b7-89cd-ef2a2bc1dc5e.snappy.parquet

is it correct ?



@Michael Bronson

You can safely delete them

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.