Created 03-17-2017 11:41 AM
I am trying to create a report containing HDFS space usage per directory
The command I am using is hdfs hadoop fs -du /
While this command gives an overview, it does not take the replication factor into account.
Requesting more info via the quota command hdfs hadoop fs -count -q / shows raw space usage when quota are set, but certainly not for all of our directories.
So to calculate a correct space usage these commands are pretty useless. Does anyone have a good approach to calculate the space usage correctly ?
,,
I am trying to set up a space usage report. At this moment I am using this command, to show actual space used.
hdfs hadoop fs -du /
However this command does not take into account the replication factor of the data and therefore is not showing the correct space usage. I tried to combine, as a circumvention, to use this command:
hdfs hadoop fs -count -q /
While this shows raw space info for some directories, it does certainly not for all directories (when quota is not set). So to get a correct overview of which directories are consuming space, all these commands seem pretty useless.
Does anyone have experience with this ?
Created 03-17-2017 02:54 PM
Have you used the dfsadmin command before? This will tell you dfs remaining and used:
hdfs dfsadmin -report
Created 03-17-2017 06:33 PM
Use the following command to get more details about live, dead, decommissioned data nodes along with their respective respective configured capacity and DFS/Non-DFS usage etc.
hdfs dfsadmin -report -live -dead -decommissioning
Created 03-20-2017 08:08 AM
The dfsadmin -report command only returns data on cluster and node level, which is not sufficient for me.
As an example: on our cluster we host several projects, each with a different replication factor. And we want to know what these projects actually really consume on disk.
Since these projects are located in different sub-directories we can retrieve the usage by
hdfs hadoop fs -du / . But the data is not correct, since it does not take the replication factor.
Created 03-20-2017 06:25 PM
Please read this: https://community.hortonworks.com/questions/89641/disk-size-used-is-bigger-than-replication-number-m...
To know what is available, what you need is the number of blocks unused and multiply that with the size of the block. If your replication factor is 3 then quantify the number of blocks unused x size/block in all your data nodes and divide by three.
To know what is used, it is the same: number of blocks used x size/block, but your blocks are most likely at < 100% ...
+++
If this helped, please vote/accept as best answer.