Created 10-24-2016 01:08 PM
Howdy,
Thanks for your question. It can be quite jarring to see two columns in your output when du normally only has one column, but fear not, there is an explanation for everything. 🙂
I found a similar post discussing this topic. From looking at it, it's clear that it took some digging to get to the bottom of this, but if you look towards the bottom, you'll see a link to the source code explaining the output format. Eureka!
Anywho, the output states that the first two columns are formatted like this:
[size] [disk space consumed]
but what does this mean? the "size" field is the base size of the file or directory before replication. As we know, HDFS replicates files, so the second field(disk space consumed) is included to show you how much total disk space that file or directory takes up after it's been replicated. Under the default replication factor of three, the first two columns of a 1MB file would theoretically look like this.
1 3 M
The fun part is that we can actually use this info to infer the replication factor HDFS is using for these particular files, or at least the amount of replication the file is curently at. If you look at the first line of your output, you'll see the initial size as 816 and the disk space usage as 1.6 K. Divide 1.6K by 816 bytes, and you get 2 (roughly), which would indicate a replication factor of two, and you'll notice this math is consistent with the other entries in the output. Good times.
Armed with this knowledge, you can now use du tool to its full potential, both for informative and troubleshooting purposes. Let me know if this info was helpful or if you have any other questions. 🙂
Cheers