Created on 02-14-2016 05:26 PM
For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script:
#!/usr/bin/env bash
max_depth=5
largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "')
printf "%15s %s\n" "bytes" "directory"
for ld in $largest_root_dirs; do
printf "%15.0f %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld
all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" )
for d in $all_dirs; do
line=$(hdfs dfs -du -s $d)
size=$(echo $line | cut -d' ' -f1)
parent_dir=${d%/*}
child=${d##*/}
if [ -n "$parent_dir" ]; then
leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/')
d=${leading_dirs}/$child
fi
printf "%15.0f %s\n" $size $d
done
done
Sample output:
bytes directory
480376973 /hdp
480376973 |---/apps
480376973 |--------/2.3.4.0-3485
98340772 |---------------------/hive
210320342 |---------------------/mapreduce
97380893 |---------------------/pig
15830286 |---------------------/sqoop
58504680 |---------------------/tez
24453973 /user
0 |----/admin
3629715 |----/ambari-qa
3440200 |--------------/.staging
653010 |-----------------------/job_1454293069490_0001
Created on 02-14-2016 06:14 PM
Great, thanks for sharing! This might also help https://github.com/mr-jstraub/HDFSQuota/blob/master/HDFSQuota.ipynb
Created on 09-08-2017 09:19 AM
Thanks Vladimir!
I liked this so much I created a contrib folder in my Perl tools repo to save it.
You can find similarly useful tools for Hadoop, HBase, Hive, Ambari etc there in adjacent repos:
https://github.com/HariSekhon/tools (Perl, this is where I copied this script, there is also stuff for HDFS file and snapshots age out, Ambari FreeIPA automation, Hive SQL recaser etc)
https://github.com/HariSekhon/pytools (lots of PySpark, HDFS and HBase code here as well as Ambari Blueprint CLI tool, data validators and converters)
https://github.com/HariSekhon/nagios-plugins (350+ enterprise monitoring plugins for most of the Hadoop ecosystem, including integration to Ambari for everything it monitors too, plus NoSQL datastores, message queues, CI and infrastructure)
Created on 09-08-2017 09:36 AM
Oh Jonas that is excellent, thanks very much for the link, I've starred your repo and am going to put links to it on my PyTools and Tools repos which have a whole selection of related Hadoop tools as I think people would be interested in that.