Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Do we have any script which we can use to clean /tmp/hive/ dir frequently on hdfs. Because it is consuming space in TB.

avatar
Guru

Do we have any script which we can use to clean /tmp/hive/ dir frequently on hdfs. Because it is consuming space in TB.

I have gone through below one but I am looking for any shell script.

https://github.com/nmilford/clean-hadoop-tmp/blob/master/clean-hadoop-tmp

1 ACCEPTED SOLUTION

avatar
Contributor

You can do:

#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done

Replace the directories or files you need to clean up appropriately.

View solution in original post

13 REPLIES 13

avatar
Master Guru

Create a file /scripts/myLogCleaner.sh ( or whatever )

add the following command ( which deletes all files having a log in the name and are older than a day )

find /tmp/hive -name *log* -mtime +1 -exec rm {} \;

and crontab it.

crontab -e

0 0 * * * /scripts/myLogCleaner.sh

This will start the cleaner every day at midnight.

( obviously just one out of approximately 3 million different ways to do it 🙂 )

Edit: ah not the logs of the hive CLI but the scratch dir of hive. That makes it a bit harder since there is no hadoop find. Weird that it grows so big it should clean up after itself unless the command line interface or task gets killed.

avatar
Master Mentor

this is on hdfs Benjamin. I mean same approach just hdfs commands vs local fs.

avatar
Guru

@Benjamin Leonhardi : I can do it easily with local but I am looking for hdfs /tmp/hive dir.

So do we have anything like this for hdfs.

avatar
Master Guru

That would be the time when I start writing some python magic parsing the timestamp from the hadoop -ls output command. Or to be faster a small Java program doing the same with the FileSystem API.

Someone already did the first approach with shell script apparently. Replace the echo with a hadoop fs -rm -r -f and you might be good. But I didn't test it obviously ...

http://stackoverflow.com/questions/12613848/finding-directories-older-than-n-days-in-hdfs

avatar
Master Mentor

@Benjamin Leonhardi yep, I've done that a while ago with java hdfs api. Look up the paths, identify age of files, delete.

avatar
Super Collaborator

avatar
Expert Contributor

@Saurabh Kumar To add to this, you could investigate third party dev projects such as https://github.com/nmilford/clean-hadoop-tmp

avatar
Contributor

You can do:

#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done

Replace the directories or files you need to clean up appropriately.

avatar
Guru

@Gurmukh Singh: I tried this script and not getting anything just below output.

[user@server2~]$ ./cleanup.sh

Usage: dir_diff.sh [30]

I have same thing in script which you have mentioned.