Community Articles
Find and share helpful community-sourced technical articles
Labels (1)

For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script:

#!/usr/bin/env bash

largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "')

printf "%15s  %s\n" "bytes" "directory"
for ld in $largest_root_dirs; do
    printf "%15.0f  %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld
    all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" )

    for d in $all_dirs; do
        line=$(hdfs dfs -du -s $d)
        size=$(echo $line | cut -d' ' -f1)
        if [ -n "$parent_dir" ]; then
            leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/')
        printf "%15.0f  %s\n" $size $d

Sample output:

     bytes  directory
 480376973  /hdp
 480376973  |---/apps
 480376973  |--------/
  98340772  |---------------------/hive
 210320342  |---------------------/mapreduce
  97380893  |---------------------/pig
  15830286  |---------------------/sqoop
  58504680  |---------------------/tez
  24453973  /user
         0  |----/admin
   3629715  |----/ambari-qa
   3440200  |--------------/.staging
    653010  |-----------------------/job_1454293069490_0001

Great, thanks for sharing! This might also help

New Contributor

Thanks Vladimir!

I liked this so much I created a contrib folder in my Perl tools repo to save it.

You can find similarly useful tools for Hadoop, HBase, Hive, Ambari etc there in adjacent repos: (Perl, this is where I copied this script, there is also stuff for HDFS file and snapshots age out, Ambari FreeIPA automation, Hive SQL recaser etc) (lots of PySpark, HDFS and HBase code here as well as Ambari Blueprint CLI tool, data validators and converters) (350+ enterprise monitoring plugins for most of the Hadoop ecosystem, including integration to Ambari for everything it monitors too, plus NoSQL datastores, message queues, CI and infrastructure)

New Contributor

Oh Jonas that is excellent, thanks very much for the link, I've starred your repo and am going to put links to it on my PyTools and Tools repos which have a whole selection of related Hadoop tools as I think people would be interested in that.