Created on 01-12-2016 04:49 PM - edited 09-16-2022 02:56 AM
Recently I noticed that Cloudera Manager is showing more data in HDFS storage than I believed I was using.
As such, I investigate via the command line, starting with the following command:
[hdfs@cdhcan01 ~]$ hadoop fs -du -h /
And I see that the /tmp directory is several hundred GB (with replication over a TB), so I dig deeper, when I check:
[hdfs@cdhcan01 ~]$ hadoop fs -du -h /tmp/
I see that the majority of this space is taken up by the /tmp/hive/ subdirectory, So looking into that:
[hdfs@cdhcan01 ~]$ hadoop fs -du -h /tmp/hive/
I see the following which shows a heck of a lot of storage for 2 of the users compared to everyone else:
351.8 G 1.0 T /tmp/hive/admin 0 0 /tmp/hive/anonymous 195.7 G 587.1 G /tmp/hive/cdh-oozie 0 0 /tmp/hive/csalas 0 0 /tmp/hive/hive 0 0 /tmp/hive/jculley 0 0 /tmp/hive/jfogarty 0 0 /tmp/hive/jjohnbosco 0 0 /tmp/hive/jkarmelek 0 0 /tmp/hive/jmasloski 0 0 /tmp/hive/pscott
The cdh-oozie user runs many hiveserver2 actions on Oozie, so it makes sense to me that it has a lot of storage being used... it's a lot, but believable that it would use a lot of space for hive.
However, that admin user is the surprise and also the big hog. I kept digging into the /tmp/hive/admin/ subdirectories and found what look like sessions from six months ago, below I show where this finally led me (there are 638 items but I just show the first 2) and this looks to me like pieces of an old hive query:
[hdfs@cdhcan01 ~]$ hadoop fs -ls /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001 Found 638 items -rw-r--r-- 3 admin supergroup 441903776 2015-06-22 17:56 /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001/000000_0 -rw-r--r-- 3 admin supergroup 448117217 2015-06-22 17:55 /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001/000001_0
I'd like to go through and clean up this /tmp/hive/admin/ directory but I'm not really sure how it's getting populated?
Why wouldn't HDFS or Hive have cleaned this up on its own, especially when it looks clean for other users?
Can someone point me in the right direction of figuring out if I can go ahead and start deleting these items to free up space?
Finally what's generally going on to populate the /tmp/hive/ subdirectories and when does it get cleaned out?
Thanks for any help or insight into this!
Created 02-18-2016 12:59 AM
Created 02-18-2016 12:59 AM
Created 03-20-2016 07:45 AM
Thanks, I ended up just removing these as they are orphaned data sets from failed sessions.
Created 08-03-2018 12:10 PM