Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script:

#!/usr/bin/env bash
max_depth=5

largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "')

printf "%15s  %s\n" "bytes" "directory"
for ld in $largest_root_dirs; do
    printf "%15.0f  %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld
    all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" )

    for d in $all_dirs; do
        line=$(hdfs dfs -du -s $d)
        size=$(echo $line | cut -d' ' -f1)
        parent_dir=${d%/*}
        child=${d##*/}
        if [ -n "$parent_dir" ]; then
            leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/')
            d=${leading_dirs}/$child
        fi
        printf "%15.0f  %s\n" $size $d
    done
done

Sample output:

     bytes  directory
 480376973  /hdp
 480376973  |---/apps
 480376973  |--------/2.3.4.0-3485
  98340772  |---------------------/hive
 210320342  |---------------------/mapreduce
  97380893  |---------------------/pig
  15830286  |---------------------/sqoop
  58504680  |---------------------/tez
  24453973  /user
         0  |----/admin
   3629715  |----/ambari-qa
   3440200  |--------------/.staging
    653010  |-----------------------/job_1454293069490_0001
18,084 Views
Comments
avatar

Great, thanks for sharing! This might also help https://github.com/mr-jstraub/HDFSQuota/blob/master/HDFSQuota.ipynb

avatar
Contributor

Thanks Vladimir!

I liked this so much I created a contrib folder in my Perl tools repo to save it.

You can find similarly useful tools for Hadoop, HBase, Hive, Ambari etc there in adjacent repos:

https://github.com/HariSekhon/tools (Perl, this is where I copied this script, there is also stuff for HDFS file and snapshots age out, Ambari FreeIPA automation, Hive SQL recaser etc)

https://github.com/HariSekhon/pytools (lots of PySpark, HDFS and HBase code here as well as Ambari Blueprint CLI tool, data validators and converters)

https://github.com/HariSekhon/nagios-plugins (350+ enterprise monitoring plugins for most of the Hadoop ecosystem, including integration to Ambari for everything it monitors too, plus NoSQL datastores, message queues, CI and infrastructure)

avatar
Contributor

Oh Jonas that is excellent, thanks very much for the link, I've starred your repo and am going to put links to it on my PyTools and Tools repos which have a whole selection of related Hadoop tools as I think people would be interested in that.