Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hadoop Space Calculation,,How can I calculate the actual dfs storage used correctly ?

avatar
New Contributor

I am trying to create a report containing HDFS space usage per directory

The command I am using is hdfs hadoop fs -du /

While this command gives an overview, it does not take the replication factor into account.

Requesting more info via the quota command hdfs hadoop fs -count -q / shows raw space usage when quota are set, but certainly not for all of our directories.

So to calculate a correct space usage these commands are pretty useless. Does anyone have a good approach to calculate the space usage correctly ?

,,

I am trying to set up a space usage report. At this moment I am using this command, to show actual space used.

hdfs hadoop fs -du /

However this command does not take into account the replication factor of the data and therefore is not showing the correct space usage. I tried to combine, as a circumvention, to use this command:

hdfs hadoop fs -count -q /

While this shows raw space info for some directories, it does certainly not for all directories (when quota is not set). So to get a correct overview of which directories are consuming space, all these commands seem pretty useless.

Does anyone have experience with this ?

4 REPLIES 4

avatar

Have you used the dfsadmin command before? This will tell you dfs remaining and used:

hdfs dfsadmin -report

avatar
Cloudera Employee

@Guy Riems

Use the following command to get more details about live, dead, decommissioned data nodes along with their respective respective configured capacity and DFS/Non-DFS usage etc.

hdfs dfsadmin -report -live -dead -decommissioning

avatar
New Contributor

The dfsadmin -report command only returns data on cluster and node level, which is not sufficient for me.

As an example: on our cluster we host several projects, each with a different replication factor. And we want to know what these projects actually really consume on disk.

Since these projects are located in different sub-directories we can retrieve the usage by

hdfs hadoop fs -du / . But the data is not correct, since it does not take the replication factor.

avatar
Super Guru

@Guy Riems

Please read this: https://community.hortonworks.com/questions/89641/disk-size-used-is-bigger-than-replication-number-m...

To know what is available, what you need is the number of blocks unused and multiply that with the size of the block. If your replication factor is 3 then quantify the number of blocks unused x size/block in all your data nodes and divide by three.

To know what is used, it is the same: number of blocks used x size/block, but your blocks are most likely at < 100% ...

+++

If this helped, please vote/accept as best answer.