Created 02-24-2016 01:12 PM
Do we have any script which we can use to clean /tmp/hive/ dir frequently on hdfs. Because it is consuming space in TB.
I have gone through below one but I am looking for any shell script.
https://github.com/nmilford/clean-hadoop-tmp/blob/master/clean-hadoop-tmp
Created 08-30-2016 11:24 PM
You can do:
#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done
Replace the directories or files you need to clean up appropriately.
Created 02-24-2016 01:19 PM
Create a file /scripts/myLogCleaner.sh ( or whatever )
add the following command ( which deletes all files having a log in the name and are older than a day )
find /tmp/hive -name *log* -mtime +1 -exec rm {} \;
and crontab it.
crontab -e
0 0 * * * /scripts/myLogCleaner.sh
This will start the cleaner every day at midnight.
( obviously just one out of approximately 3 million different ways to do it 🙂 )
Edit: ah not the logs of the hive CLI but the scratch dir of hive. That makes it a bit harder since there is no hadoop find. Weird that it grows so big it should clean up after itself unless the command line interface or task gets killed.
Created 02-24-2016 01:23 PM
this is on hdfs Benjamin. I mean same approach just hdfs commands vs local fs.
Created 02-24-2016 01:37 PM
@Benjamin Leonhardi : I can do it easily with local but I am looking for hdfs /tmp/hive dir.
So do we have anything like this for hdfs.
Created 02-24-2016 01:43 PM
That would be the time when I start writing some python magic parsing the timestamp from the hadoop -ls output command. Or to be faster a small Java program doing the same with the FileSystem API.
Someone already did the first approach with shell script apparently. Replace the echo with a hadoop fs -rm -r -f and you might be good. But I didn't test it obviously ...
http://stackoverflow.com/questions/12613848/finding-directories-older-than-n-days-in-hdfs
Created 02-24-2016 01:44 PM
@Benjamin Leonhardi yep, I've done that a while ago with java hdfs api. Look up the paths, identify age of files, delete.
Created 02-24-2016 01:54 PM
Created 04-08-2016 03:05 PM
@Saurabh Kumar To add to this, you could investigate third party dev projects such as https://github.com/nmilford/clean-hadoop-tmp
Created 08-30-2016 11:24 PM
You can do:
#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done
Replace the directories or files you need to clean up appropriately.
Created 09-21-2016 10:16 AM
@Gurmukh Singh: I tried this script and not getting anything just below output.
[user@server2~]$ ./cleanup.sh
Usage: dir_diff.sh [30]
I have same thing in script which you have mentioned.