Created on 02-14-2016 05:26 PM
For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script:
#!/usr/bin/env bash max_depth=5 largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "') printf "%15s %s\n" "bytes" "directory" for ld in $largest_root_dirs; do printf "%15.0f %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" ) for d in $all_dirs; do line=$(hdfs dfs -du -s $d) size=$(echo $line | cut -d' ' -f1) parent_dir=${d%/*} child=${d##*/} if [ -n "$parent_dir" ]; then leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/') d=${leading_dirs}/$child fi printf "%15.0f %s\n" $size $d done done
Sample output:
bytes directory 480376973 /hdp 480376973 |---/apps 480376973 |--------/2.3.4.0-3485 98340772 |---------------------/hive 210320342 |---------------------/mapreduce 97380893 |---------------------/pig 15830286 |---------------------/sqoop 58504680 |---------------------/tez 24453973 /user 0 |----/admin 3629715 |----/ambari-qa 3440200 |--------------/.staging 653010 |-----------------------/job_1454293069490_0001
Created on 02-14-2016 06:14 PM
Great, thanks for sharing! This might also help https://github.com/mr-jstraub/HDFSQuota/blob/master/HDFSQuota.ipynb
Created on 09-08-2017 09:19 AM
Thanks Vladimir!
I liked this so much I created a contrib folder in my Perl tools repo to save it.
You can find similarly useful tools for Hadoop, HBase, Hive, Ambari etc there in adjacent repos:
https://github.com/HariSekhon/tools (Perl, this is where I copied this script, there is also stuff for HDFS file and snapshots age out, Ambari FreeIPA automation, Hive SQL recaser etc)
https://github.com/HariSekhon/pytools (lots of PySpark, HDFS and HBase code here as well as Ambari Blueprint CLI tool, data validators and converters)
https://github.com/HariSekhon/nagios-plugins (350+ enterprise monitoring plugins for most of the Hadoop ecosystem, including integration to Ambari for everything it monitors too, plus NoSQL datastores, message queues, CI and infrastructure)
Created on 09-08-2017 09:36 AM
Oh Jonas that is excellent, thanks very much for the link, I've starred your repo and am going to put links to it on my PyTools and Tools repos which have a whole selection of related Hadoop tools as I think people would be interested in that.