Created on 06-12-2018 02:04 PM - edited 09-16-2022 06:20 AM
I am working in a 16 node cluster and recently i received an issue with the Non-DFS storage where the /hadoop/ mount which I am using is being consumed for non-DFS data and when I observed i found lot of blk_12345 etc. named .meta and original files. where each file is of 128 MB size and .meta file of size 1.1 MB (Totally all the files in total are consuming 1.7 TB of total cluster storage). Please let me know if I can remove these files and what is the impact if I remove them.
And what is the reason they are created?
Created on 06-21-2018 07:02 PM - edited 08-17-2019 07:23 PM
Hi @Karthik Chandrashekhar!
Sorry about my delay, so taking a look at your du outputs, it looks like HDFS is doing okay with the DFS total size.
If you sum the values from 16 hosts under the /hadoop/hadoop/hdfs/data it will be equal to 1.7TB.
Do you have a specific mount disk for /hadoop/hadoop/hdfs/data? Or it's all under the / directory, and how many disks do you have?
E.g., in my case, I have a lab and its everything under the / directory in 1 disk.
[root@c1123-node3 hadoop]# df -h Filesystem Size Used Avail Use% Mounted on rootfs 1.2T 731G 423G 64% / overlay 1.2T 731G 423G 64% / tmpfs 126G 0 126G 0% /dev tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/mapper/vg01-vsr_lib_docker 1.2T 731G 423G 64% /etc/resolv.conf /dev/mapper/vg01-vsr_lib_docker 1.2T 731G 423G 64% /etc/hostname /dev/mapper/vg01-vsr_lib_docker 1.2T 731G 423G 64% /etc/hosts shm 64M 12K 64M 1% /dev/shm overlay 1.2T 731G 423G 64% /proc/meminfo
If I calculate on all hosts the du --max-depth=1-h /hadoop/hdfs/data, I'll get my DFS Usage.
And if I calculate on all hosts my du --max-depth=1-h / minus the value from HDFS directory, I'll get the total of Non-dfs usage.
So the math would be:
DFS Usage = Total DU on the HDFS Path
NON-DFS Usage = Total DU - (DFS Usage)
For each disk.
And answering your last question, the finalized folder it's used by HDFS to allocated the blocks that have been processed. So deleting these files, it'll probably throw some alerts from HDFS to you (maybe a block missing a replica or some corrupted block).
I completely understand your concern about your storage getting almost full, but, If you aren't able to delete any data outside of the HDFS, I'd try to delete old and unused files from HDFS (using the HDFS DFS command!), compress any raw data, use more file formats with compression enabled or in last case change the replication-factor to a lower value (kindly remember that changing this, it may cause some problems). Just a friendly reminder, everything under the dfs.datanode.data.dir will be used internally for HDFS storing purposes 🙂
Hope this helps!
Created 06-19-2018 08:01 PM
Hey @Karthik Chandrashekhar!
Unfortunately, i'm not able to see the content of your attach, could you upload it again, please?
And regarding, your subdir, could you share the output from the following command:
du --max-depth=1 -h /hadoop/
Hope this helps!
Created on 06-20-2018 05:59 AM - edited 08-17-2019 07:23 PM
Created 06-21-2018 04:33 AM
Can you please let me know if I can delete the subdirNNN files under finalized folder?
And how to permanently stop files being stored in finalized folder?
Created on 06-21-2018 07:02 PM - edited 08-17-2019 07:23 PM
Hi @Karthik Chandrashekhar!
Sorry about my delay, so taking a look at your du outputs, it looks like HDFS is doing okay with the DFS total size.
If you sum the values from 16 hosts under the /hadoop/hadoop/hdfs/data it will be equal to 1.7TB.
Do you have a specific mount disk for /hadoop/hadoop/hdfs/data? Or it's all under the / directory, and how many disks do you have?
E.g., in my case, I have a lab and its everything under the / directory in 1 disk.
[root@c1123-node3 hadoop]# df -h Filesystem Size Used Avail Use% Mounted on rootfs 1.2T 731G 423G 64% / overlay 1.2T 731G 423G 64% / tmpfs 126G 0 126G 0% /dev tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/mapper/vg01-vsr_lib_docker 1.2T 731G 423G 64% /etc/resolv.conf /dev/mapper/vg01-vsr_lib_docker 1.2T 731G 423G 64% /etc/hostname /dev/mapper/vg01-vsr_lib_docker 1.2T 731G 423G 64% /etc/hosts shm 64M 12K 64M 1% /dev/shm overlay 1.2T 731G 423G 64% /proc/meminfo
If I calculate on all hosts the du --max-depth=1-h /hadoop/hdfs/data, I'll get my DFS Usage.
And if I calculate on all hosts my du --max-depth=1-h / minus the value from HDFS directory, I'll get the total of Non-dfs usage.
So the math would be:
DFS Usage = Total DU on the HDFS Path
NON-DFS Usage = Total DU - (DFS Usage)
For each disk.
And answering your last question, the finalized folder it's used by HDFS to allocated the blocks that have been processed. So deleting these files, it'll probably throw some alerts from HDFS to you (maybe a block missing a replica or some corrupted block).
I completely understand your concern about your storage getting almost full, but, If you aren't able to delete any data outside of the HDFS, I'd try to delete old and unused files from HDFS (using the HDFS DFS command!), compress any raw data, use more file formats with compression enabled or in last case change the replication-factor to a lower value (kindly remember that changing this, it may cause some problems). Just a friendly reminder, everything under the dfs.datanode.data.dir will be used internally for HDFS storing purposes 🙂
Hope this helps!
Created 06-26-2018 05:39 AM
Thank you very much @Vinicius Higa Murakami