Created 04-30-2018 04:00 PM
we have ambari cluster with HDP version 26 ( production system )
when we run the following command in order to verify which files have corrupted blocks
hdfs fsck / |egrep -v '^\.+$' | grep -v replica | grep -v Replica
we get:
/localF/STRZONEZone/intercept_by_country/2018/4/10/16/2018_4_10_16_45.parquet/part-00003-8600d0e2-c6b6-49b7-89cd-ef2a2bc1dc5e.snappy.parquet: CORRUPT blockpool BP-338831142-28.12.45.6-1508451686931 block blk_1097240348 /localF/STRZONEZone/intercept_by_country/2018/4/10/16/2018_4_10_16_45.parquet/part-00003-8600d0e2-c6b6-49b7-89cd-ef2a2bc1dc5e.snappy.parquet: MISSING 1 blocks of total size 1192 B........................................... /localF/STRZONEZone/intercept_by_type/2018/4/10/16/2018_4_10_16_45.parquet/part-00002-be0f80a9-2c7c-4c50-b18d-73be372acff.snappy.parquet: CORRUPT blockpool BP-338831142-28.12.45.6-1508451686931 block blk_1097240344 /localF/STRZONEZone/intercept_by_type/2018/4/10/16/2018_4_10_16_45.parquet/part-00002-be0f80a9-2c7c-4c50-b18d-73be372acff.snappy.parquet: MISSING 1 blocks of total size 1098 B............................................... ..................................Status: CORRUPT Total size:7072689634566 B (Total open files size: 293676105509 B) Total dirs:32330710 Total files:910568034 Total symlinks:0 (Files currently being written: 12) Total blocks (validated):10183608 (avg. block size 6254517 B) (Total open file blocks (not validated): 2200) ******************************** UNDER MIN REPL'D BLOCKS:2 (1.9345605E-5 %) CORRUPT FILES:2 MISSING BLOCKS:2 MISSING SIZE:2290 B CORRUPT BLOCKS: 2 ******************************** Corrupt blocks:2 Number of data-nodes:35 Number of racks:1 FSCK ended at Mon Apr 20 11:40:50 UTC 2018 in 241684 milliseconds The filesystem under path '/' is CORRUPT
in this case that we see :
CORRUPT FILES:2 MISSING BLOCKS:2
what is the right action to do ? , or corrupted blocs solutuion ?
Created 05-01-2018 07:39 PM
Created 04-30-2018 08:54 PM
Its important to determine the importance of the file, can it just be removed and copied back into place, or is there sensitive data that needs to be regenerated? If it's easy enough just to replace the file, that's the route I would take.
HDFS will attempt to recover the situation automatically. By default there are three replicas of any block in the cluster. so if HDFS detects that one replica of a block has become corrupt or damaged, HDFS will create a new replica of that block from a known-good replica, and will mark the damaged one for deletion.
The known-good state is determined by checksums which are recorded alongside the block by each DataNode.
This will list the corrupt HDFS blocks:
hdfs fsck -list-corruptfileblocks
This will delete the corrupted HDFS blocks:
hdfs fsck / -delete
Once you find a file that is corrupt
hdfs fsck /path/to/corrupt/file -locations -blocks -files
Use that output to determine where blocks might live. If the file is larger than your block size it might have multiple blocks.
You can use the reported block numbers to go around to the DataNodes and the NameNode logs searching for the machine or machines on which the blocks lived. Try looking for filesystem errors on those machines. Missing mount points, DataNode not running, file system reformatted/reprovisioned. If you can find a problem in that way and bring the block back online that file will be healthy again.
Lather rinse and repeat until all files are healthy or you exhaust all alternatives looking for the blocks.
Once you determine what happened and you cannot recover any more blocks, just use the below command
hdfs fs -rm /path/to/file/with/permanently/missing/blocks
command to get your HDFS filesystem back to healthy so you can start tracking new errors as they occur.
Created 05-01-2018 11:07 AM
Yes that should delete the corrupt blocks notice the space between the / and -delete or simply using the -rm option see below
hdfs fs -rm /path/to/file/with/permanently/missing/blocks
To delete the first missing block in the case of your, output above this will be rebalanced with time or run manually the balancer i.e
hdfs fs -rm /localF/STRZONEZone/intercept_by_country/2018/4/10/16/2018_4_10_16_45.parquet/part-00003-8600d0e2-c6b6-49b7-89cd-ef2a2bc1dc5e.snappy.parquet
Hope that clarifies
Created 05-01-2018 02:35 PM
just for summary ( this is production system )
the final steps are
is it correct ?
Created 05-01-2018 07:39 PM
You can safely delete them