I am trying to create a report containing HDFS space usage per directory
The command I am using is hdfs hadoop fs -du /
While this command gives an overview, it does not take the replication factor into account.
Requesting more info via the quota command hdfs hadoop fs -count -q / shows raw space usage when quota are set, but certainly not for all of our directories.
So to calculate a correct space usage these commands are pretty useless. Does anyone have a good approach to calculate the space usage correctly ?
I am trying to set up a space usage report. At this moment I am using this command, to show actual space used.
hdfs hadoop fs -du /
However this command does not take into account the replication factor of the data and therefore is not showing the correct space usage. I tried to combine, as a circumvention, to use this command:
hdfs hadoop fs -count -q /
While this shows raw space info for some directories, it does certainly not for all directories (when quota is not set). So to get a correct overview of which directories are consuming space, all these commands seem pretty useless.
Does anyone have experience with this ?
Have you used the dfsadmin command before? This will tell you dfs remaining and used:
hdfs dfsadmin -report
Use the following command to get more details about live, dead, decommissioned data nodes along with their respective respective configured capacity and DFS/Non-DFS usage etc.
hdfs dfsadmin -report -live -dead -decommissioning
The dfsadmin -report command only returns data on cluster and node level, which is not sufficient for me.
As an example: on our cluster we host several projects, each with a different replication factor. And we want to know what these projects actually really consume on disk.
Since these projects are located in different sub-directories we can retrieve the usage by
hdfs hadoop fs -du / . But the data is not correct, since it does not take the replication factor.
To know what is available, what you need is the number of blocks unused and multiply that with the size of the block. If your replication factor is 3 then quantify the number of blocks unused x size/block in all your data nodes and divide by three.
To know what is used, it is the same: number of blocks used x size/block, but your blocks are most likely at < 100% ...
If this helped, please vote/accept as best answer.