Support Questions

SK1 · ‎02-24-2016

Do we have any script which we can use to clean /tmp/hive/ dir frequently on hdfs. Because it is consuming space in TB.

I have gone through below one but I am looking for any shell script.

https://github.com/nmilford/clean-hadoop-tmp/blob/master/clean-hadoop-tmp

trainings · ‎08-30-2016

You can do:

#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done

Replace the directories or files you need to clean up appropriately.

View solution in original post

bleonhardi · ‎02-24-2016

Create a file /scripts/myLogCleaner.sh ( or whatever )

add the following command ( which deletes all files having a log in the name and are older than a day )

find /tmp/hive -name *log* -mtime +1 -exec rm {} \;

and crontab it.

crontab -e

0 0 * * * /scripts/myLogCleaner.sh

This will start the cleaner every day at midnight.

( obviously just one out of approximately 3 million different ways to do it 🙂 )

Edit: ah not the logs of the hive CLI but the scratch dir of hive. That makes it a bit harder since there is no hadoop find. Weird that it grows so big it should clean up after itself unless the command line interface or task gets killed.

aervits · ‎02-24-2016

this is on hdfs Benjamin. I mean same approach just hdfs commands vs local fs.

SK1 · ‎02-24-2016

@Benjamin Leonhardi : I can do it easily with local but I am looking for hdfs /tmp/hive dir.

So do we have anything like this for hdfs.

bleonhardi · ‎02-24-2016

That would be the time when I start writing some python magic parsing the timestamp from the hadoop -ls output command. Or to be faster a small Java program doing the same with the FileSystem API.

Someone already did the first approach with shell script apparently. Replace the echo with a hadoop fs -rm -r -f and you might be good. But I didn't test it obviously ...

http://stackoverflow.com/questions/12613848/finding-directories-older-than-n-days-in-hdfs

aervits · ‎02-24-2016

@Benjamin Leonhardi yep, I've done that a while ago with java hdfs api. Look up the paths, identify age of files, delete.

rahulpathak109 · ‎02-24-2016

@Saurabh Kumar

I have not used this but worth trying.

https://issues.apache.org/jira/browse/HADOOP-8989

iroberts · ‎04-08-2016

@Saurabh Kumar To add to this, you could investigate third party dev projects such as https://github.com/nmilford/clean-hadoop-tmp

trainings · ‎08-30-2016

You can do:

#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done

Replace the directories or files you need to clean up appropriately.

SK1 · ‎09-21-2016

@Gurmukh Singh: I tried this script and not getting anything just below output.

[user@server2~]$ ./cleanup.sh

Usage: dir_diff.sh [30]

I have same thing in script which you have mentioned.