Support Questions
Find answers, ask questions, and share your expertise

Do we have any script which we can use to clean /tmp/hive/ dir frequently on hdfs. Because it is consuming space in TB.

Guru

Do we have any script which we can use to clean /tmp/hive/ dir frequently on hdfs. Because it is consuming space in TB.

I have gone through below one but I am looking for any shell script.

https://github.com/nmilford/clean-hadoop-tmp/blob/master/clean-hadoop-tmp

1 ACCEPTED SOLUTION

Accepted Solutions

Explorer

You can do:

#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done

Replace the directories or files you need to clean up appropriately.

View solution in original post

13 REPLIES 13

Create a file /scripts/myLogCleaner.sh ( or whatever )

add the following command ( which deletes all files having a log in the name and are older than a day )

find /tmp/hive -name *log* -mtime +1 -exec rm {} \;

and crontab it.

crontab -e

0 0 * * * /scripts/myLogCleaner.sh

This will start the cleaner every day at midnight.

( obviously just one out of approximately 3 million different ways to do it 🙂 )

Edit: ah not the logs of the hive CLI but the scratch dir of hive. That makes it a bit harder since there is no hadoop find. Weird that it grows so big it should clean up after itself unless the command line interface or task gets killed.

Mentor

this is on hdfs Benjamin. I mean same approach just hdfs commands vs local fs.

Guru

@Benjamin Leonhardi : I can do it easily with local but I am looking for hdfs /tmp/hive dir.

So do we have anything like this for hdfs.

That would be the time when I start writing some python magic parsing the timestamp from the hadoop -ls output command. Or to be faster a small Java program doing the same with the FileSystem API.

Someone already did the first approach with shell script apparently. Replace the echo with a hadoop fs -rm -r -f and you might be good. But I didn't test it obviously ...

http://stackoverflow.com/questions/12613848/finding-directories-older-than-n-days-in-hdfs

Mentor

@Benjamin Leonhardi yep, I've done that a while ago with java hdfs api. Look up the paths, identify age of files, delete.

Expert Contributor

Rising Star

@Saurabh Kumar To add to this, you could investigate third party dev projects such as https://github.com/nmilford/clean-hadoop-tmp

Explorer

You can do:

#!/bin/bash usage="Usage: dir_diff.sh [days]" if [ ! "$1" ] then echo $usage exit 1 fi now=$(date +%s) hadoop fs -ls -R /tmp/ | grep "^d" | while read f; do dir_date=`echo $f | awk '{print $6}'` difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) )) if [ $difference -gt $1 ]; then hadoop fs -rm -r `echo $f | awk '{ print $8 }'`; fi done

Replace the directories or files you need to clean up appropriately.

View solution in original post

Guru

@Gurmukh Singh: I tried this script and not getting anything just below output.

[user@server2~]$ ./cleanup.sh

Usage: dir_diff.sh [30]

I have same thing in script which you have mentioned.