One of my hadoop data directory is full on all the cluster instance (same drive all the time ) 100% usage .
I have deleted almost all data in hdfs with skiptrash + expunge. I even try to reboot all boxes but still the directory is full on all cluster member
When i dive into the directory structure i can see that it is the hdfs blockpool area .
>hdfs dfs -du / 45641 /app-logs 247478401 /apps 92202 /ats 950726849 /hdp 0 /livy-recovery 0 /livy2-recovery 0 /mapred0 /mr-history 0 /project 5922 /spark-history 0 /spark2-history 2 /system 98729320 /tmp 981081678 /user 0 /webhdfs >hdfs dfs -df / Filesystem Size Used Available Use% hdfs://X:8020 412794792448 186773504950 149000060339 45%
if i go down into the data directory i end up finding blockpool file that are not known when you try to fsck them by blockId will others are.
>cd /hadoop/hdfs/data/current/BP-1356934633-X.X.X.X-1513618933915/current/finalized/subdir0/subdir150/ >ls blk_1073780387 blk_1073780392 blk_1073780395 blk_1073780463 blk_1073780475 blk_1073780387_39569.meta blk_1073780392_39574.meta blk_1073780395_39577.meta blk_1073780463_39645.meta blk_1073780475_39657.meta >hdfs fsck -locations -files -blockId blk_1073780463 Connecting to namenode via http://X.X.X.X:50070/fsck?ugi=hdfs&locations=1&files=1&blockId=blk_1073780463+&path=%2F FSCK started by hdfs (auth:X) from /X.X.X.X at Mon Jan 22 14:30:02 GMT 2018 Block blk_1073780463 does not exist >
Anyone ever seen something like that, sound the file is deleted in namenode but not on the file system, is their a command to run to check that integrity and or can i delete any blk_nnnnn file if not known when doing fsck ?
Thanks in advance for your help
In case useful for others .
The hdfs get at some stage corrupted. i made an fsck -delete, but ended up in a instable situation . All the given directory get totally full on all the node . This is related to the block scanner, which is a facility to scan all block and do necessary verification .
This only occur every 3 weeks by default due to the intensity of disk scan and IO.
So to claim back those blockpool you have to trigger the Block Scanner, which is not possible through command line .
One option can be set dfs.datanode.scan.period.hours to 1 .
You may also consider to delete the scanner.cursor files rm -rf `locate scanner.cursor` then restart the datanode .