- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on ‎02-14-2016 05:26 PM
For a UI showing the biggest consumers of space in HDFS install and configure Twitter's HDFS-DU. For a quick visual representation of HDFS disk usage with no extra tools required, use this script:
#!/usr/bin/env bash max_depth=5 largest_root_dirs=$(hdfs dfs -du -s '/*' | sort -nr | perl -ane 'print "$F[1] "') printf "%15s %s\n" "bytes" "directory" for ld in $largest_root_dirs; do printf "%15.0f %s\n" $(hdfs dfs -du -s $ld| cut -d' ' -f1) $ld all_dirs=$(hdfs dfs -ls -R $ld | egrep '^dr........' | perl -ane "scalar(split('/',\$_)) <= $max_depth && print \"\$F[7]\n\"" ) for d in $all_dirs; do line=$(hdfs dfs -du -s $d) size=$(echo $line | cut -d' ' -f1) parent_dir=${d%/*} child=${d##*/} if [ -n "$parent_dir" ]; then leading_dirs=$(echo $parent_dir | perl -pe 's/./-/g; s/^.(.+)$/\|$1/') d=${leading_dirs}/$child fi printf "%15.0f %s\n" $size $d done done
Sample output:
bytes directory 480376973 /hdp 480376973 |---/apps 480376973 |--------/2.3.4.0-3485 98340772 |---------------------/hive 210320342 |---------------------/mapreduce 97380893 |---------------------/pig 15830286 |---------------------/sqoop 58504680 |---------------------/tez 24453973 /user 0 |----/admin 3629715 |----/ambari-qa 3440200 |--------------/.staging 653010 |-----------------------/job_1454293069490_0001
Created on ‎02-14-2016 06:14 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Great, thanks for sharing! This might also help https://github.com/mr-jstraub/HDFSQuota/blob/master/HDFSQuota.ipynb
Created on ‎09-08-2017 09:19 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Thanks Vladimir!
I liked this so much I created a contrib folder in my Perl tools repo to save it.
You can find similarly useful tools for Hadoop, HBase, Hive, Ambari etc there in adjacent repos:
https://github.com/HariSekhon/tools (Perl, this is where I copied this script, there is also stuff for HDFS file and snapshots age out, Ambari FreeIPA automation, Hive SQL recaser etc)
https://github.com/HariSekhon/pytools (lots of PySpark, HDFS and HBase code here as well as Ambari Blueprint CLI tool, data validators and converters)
https://github.com/HariSekhon/nagios-plugins (350+ enterprise monitoring plugins for most of the Hadoop ecosystem, including integration to Ambari for everything it monitors too, plus NoSQL datastores, message queues, CI and infrastructure)
Created on ‎09-08-2017 09:36 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Oh Jonas that is excellent, thanks very much for the link, I've starred your repo and am going to put links to it on my PyTools and Tools repos which have a whole selection of related Hadoop tools as I think people would be interested in that.