Created 09-08-2016 11:51 AM
I recently realized that more than half of all our HDFS usage is under /tmp
I wrote a script to go find all the data and it looks like the vast majority of it is under /tmp/hive/***, for example:
/tmp/hive/root
/tmp/hive/hdfs
/tmp/hive/my_user
These have tens of TB in each of them and quite a lot of it is very old.
Is it safe to delete this data? Say, anything older than 30 days? Would 14 days be safe?
Any best practices here?
It seems odd that there is nothing built-in to maintain this space...
Created 09-08-2016 04:18 PM
Yes, it is safe to remove these folders and do a clean up. There are already actually cleanup scripts for this. Basically when a client runs a query with HiveServer2, Hive first creates these temporary folders to store intermediate/temporary data. For most queries, this is cleaned up at the end of query but sometimes due to issues with the query, these files are still hanging and you have to do a manual cleanup. Check this link for more details.
Following link might also give you some ideas on how to cleanup.
Created 09-08-2016 04:18 PM
Yes, it is safe to remove these folders and do a clean up. There are already actually cleanup scripts for this. Basically when a client runs a query with HiveServer2, Hive first creates these temporary folders to store intermediate/temporary data. For most queries, this is cleaned up at the end of query but sometimes due to issues with the query, these files are still hanging and you have to do a manual cleanup. Check this link for more details.
Following link might also give you some ideas on how to cleanup.