Support Questions

Find answers, ask questions, and share your expertise

Non-DFS storage occupied in Hadoop mount in Linux server

avatar

I am working in a 16 node cluster and recently i received an issue with the Non-DFS storage where the /hadoop/ mount which I am using is being consumed for non-DFS data and when I observed i found lot of blk_12345 etc. named .meta and original files. where each file is of 128 MB size and .meta file of size 1.1 MB (Totally all the files in total are consuming 1.7 TB of total cluster storage). Please let me know if I can remove these files and what is the impact if I remove them.

And what is the reason they are created?

1 ACCEPTED SOLUTION

avatar

Hi @Karthik Chandrashekhar!
Sorry about my delay, so taking a look at your du outputs, it looks like HDFS is doing okay with the DFS total size.
If you sum the values from 16 hosts under the /hadoop/hadoop/hdfs/data it will be equal to 1.7TB.
Do you have a specific mount disk for /hadoop/hadoop/hdfs/data? Or it's all under the / directory, and how many disks do you have?
E.g., in my case, I have a lab and its everything under the / directory in 1 disk.

[root@c1123-node3 hadoop]# df -h
Filesystem            Size  Used Avail Use% Mounted on
rootfs                1.2T  731G  423G  64% /
overlay               1.2T  731G  423G  64% /
tmpfs                 126G     0  126G   0% /dev
tmpfs                 126G     0  126G   0% /sys/fs/cgroup
/dev/mapper/vg01-vsr_lib_docker
                      1.2T  731G  423G  64% /etc/resolv.conf
/dev/mapper/vg01-vsr_lib_docker
                      1.2T  731G  423G  64% /etc/hostname
/dev/mapper/vg01-vsr_lib_docker
                      1.2T  731G  423G  64% /etc/hosts
shm                    64M   12K   64M   1% /dev/shm
overlay               1.2T  731G  423G  64% /proc/meminfo

If I calculate on all hosts the du --max-depth=1-h /hadoop/hdfs/data, I'll get my DFS Usage.
And if I calculate on all hosts my du --max-depth=1-h / minus the value from HDFS directory, I'll get the total of Non-dfs usage.
So the math would be:
DFS Usage = Total DU on the HDFS Path
NON-DFS Usage = Total DU - (DFS Usage)
For each disk.

77793-screen-shot-2018-06-21-at-145126.png

And answering your last question, the finalized folder it's used by HDFS to allocated the blocks that have been processed. So deleting these files, it'll probably throw some alerts from HDFS to you (maybe a block missing a replica or some corrupted block).
I completely understand your concern about your storage getting almost full, but, If you aren't able to delete any data outside of the HDFS, I'd try to delete old and unused files from HDFS (using the HDFS DFS command!), compress any raw data, use more file formats with compression enabled or in last case change the replication-factor to a lower value (kindly remember that changing this, it may cause some problems). Just a friendly reminder, everything under the dfs.datanode.data.dir will be used internally for HDFS storing purposes 🙂

Hope this helps!

View solution in original post

14 REPLIES 14

avatar

Hey @Karthik Chandrashekhar!
Unfortunately, i'm not able to see the content of your attach, could you upload it again, please?

And regarding, your subdir, could you share the output from the following command:

du --max-depth=1 -h /hadoop/

Hope this helps!

avatar

Hi @Vinicius Higa Murakami,

Please find the attachment again.

77768-lsblk.png

avatar

Hi @Vinicius Higa Murakami,

Can you please let me know if I can delete the subdirNNN files under finalized folder?

And how to permanently stop files being stored in finalized folder?

avatar

Hi @Karthik Chandrashekhar!
Sorry about my delay, so taking a look at your du outputs, it looks like HDFS is doing okay with the DFS total size.
If you sum the values from 16 hosts under the /hadoop/hadoop/hdfs/data it will be equal to 1.7TB.
Do you have a specific mount disk for /hadoop/hadoop/hdfs/data? Or it's all under the / directory, and how many disks do you have?
E.g., in my case, I have a lab and its everything under the / directory in 1 disk.

[root@c1123-node3 hadoop]# df -h
Filesystem            Size  Used Avail Use% Mounted on
rootfs                1.2T  731G  423G  64% /
overlay               1.2T  731G  423G  64% /
tmpfs                 126G     0  126G   0% /dev
tmpfs                 126G     0  126G   0% /sys/fs/cgroup
/dev/mapper/vg01-vsr_lib_docker
                      1.2T  731G  423G  64% /etc/resolv.conf
/dev/mapper/vg01-vsr_lib_docker
                      1.2T  731G  423G  64% /etc/hostname
/dev/mapper/vg01-vsr_lib_docker
                      1.2T  731G  423G  64% /etc/hosts
shm                    64M   12K   64M   1% /dev/shm
overlay               1.2T  731G  423G  64% /proc/meminfo

If I calculate on all hosts the du --max-depth=1-h /hadoop/hdfs/data, I'll get my DFS Usage.
And if I calculate on all hosts my du --max-depth=1-h / minus the value from HDFS directory, I'll get the total of Non-dfs usage.
So the math would be:
DFS Usage = Total DU on the HDFS Path
NON-DFS Usage = Total DU - (DFS Usage)
For each disk.

77793-screen-shot-2018-06-21-at-145126.png

And answering your last question, the finalized folder it's used by HDFS to allocated the blocks that have been processed. So deleting these files, it'll probably throw some alerts from HDFS to you (maybe a block missing a replica or some corrupted block).
I completely understand your concern about your storage getting almost full, but, If you aren't able to delete any data outside of the HDFS, I'd try to delete old and unused files from HDFS (using the HDFS DFS command!), compress any raw data, use more file formats with compression enabled or in last case change the replication-factor to a lower value (kindly remember that changing this, it may cause some problems). Just a friendly reminder, everything under the dfs.datanode.data.dir will be used internally for HDFS storing purposes 🙂

Hope this helps!

avatar

Thank you very much @Vinicius Higa Murakami