- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
HDFS block count does not decrease after deleting data
- Labels:
-
HDFS
Created 05-30-2022 08:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
after having deleted tera bytes of data from HDFS (1/4 of the total capacity), the block count among data nodes did not decrease as expected. It is still over the critical threshold.
How could it be solved?
Thank you
Created 06-08-2022 01:04 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please remember that 1 block is not necessarily 256 MB, it can be less. Also not all files have replica factor of 3, some might have only 1 replica too, so it can be totally fine if all of those were all single replica files.
600.000 * 256 MB = 153.6 TB as a maximum, but since blocks can be smaller than 256 MB, the 60 TB freed up is reasonable.
Created 05-30-2022 10:37 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @andrea_pretotto ,
This typically happens if you have snapshots on the system. Even though the "current" files are deleted from HDFS, they may be still hold by one ore more snapshots (which are exactly useful against accidental data deletions, as you can recover data from the snapshots if needed).
Please check which HDFS directories are snapshottable:
hdfs lsSnapshottableDir
and then check how many snapshots do you have under those directories:
hdfs dfs -ls /snapshottable_path/.snapshot
Probably you can also verify it with checking the output of the "du" which includes the snapshots' sizes:
hdfs dfs -du -h -v -s /snapshottable_path
vs the same which excludes the snapshots from the calculation:
hdfs dfs -du -x -h -v -s /snapshottable_path
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#du
Best regards
Miklos
Customer Operations Engineer, Cloudera
Created 05-31-2022 12:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Miklos,
thank you for the detailed answer.
I found that the parent of the directory I removed has snapshots enabled, but there are no snapshots.
The command:
hdfs dfs -du -x -h -v -s /snapshottable_path
returns no lines.
Also the output of "du" is the same.
Should I disable snapshots on the parent directory? Are there other configuration I should apply?
Thank you again.
Created 05-31-2022 01:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, the "hdfs dfs -du" for that path should return the summary of the disk usage (bytes, kbytes, megabytes, etc..) for that given path. Are you sure there are "no lines returned"? Have you checked the "du" output for a smaller subpath (which has less files underneith), does that return results?
Can you also clarify where have you checked the block count before and after the deletion? ("the block count among data nodes did not decrease as expected")
Created 05-31-2022 02:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Miklos,
sorry for the typo.. I have executed the command
hdfs dfs -ls /snapshottable_path/.snapshot
and got no lines on the directory.
The "du" commands ("du -x -h" and "du -h") report the same size.
When I click on the block count alerts on the HDFS service, I can see the number of blocks, which does not decrease.
The DataNode has 8,743,931 blocks. Critical threshold: 8,000,000 block(s).
Thank you again.
Created 05-31-2022 07:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrea,
Oh, I see, I did not consider that you see this from the DataNodes' perspective. Was this cluster recently upgraded? Is the "Finalize upgrade" step for HDFS is still pending?
While HDFS upgrade is not finalized, DataNodes keep track of all the previous blocks (including blocks deleted after the upgrade) in case a "rollback" is needed.
Created 05-31-2022 01:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you use the -skipTrash option during the deletion?
Created 06-01-2022 03:04 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created 06-01-2022 03:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DN should keep files only which are still managed and known by NN. After a huge deletion event of course these "pending deletes" may take some time to be sent to DNs (and the DNs to delete them), but usually that's not that long. Maybe check the "select pending_deletion_blocks" chart if this is applicable.
So if the above are not applicable, then check it more deeply with:
- collect a full hdfs fsck -files -blocks -locations output
- pick a DN which you think has more blocks than it should
- verify how many blocks are reported by the hdfs fsck report for that DN
- verify on DN side how many files is it storing - are those numbers matching?
Created 06-05-2022 11:11 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@andrea_pretotto, Has the reply helped resolve your issue? If so, can you please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?
Regards,
Vidya Sargur,Community Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
