- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Why does /tmp/hive/admin/ take up so much space?
Created on 01-12-2016 04:49 PM - edited 09-16-2022 02:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Recently I noticed that Cloudera Manager is showing more data in HDFS storage than I believed I was using.
As such, I investigate via the command line, starting with the following command:
[hdfs@cdhcan01 ~]$ hadoop fs -du -h /
And I see that the /tmp directory is several hundred GB (with replication over a TB), so I dig deeper, when I check:
[hdfs@cdhcan01 ~]$ hadoop fs -du -h /tmp/
I see that the majority of this space is taken up by the /tmp/hive/ subdirectory, So looking into that:
[hdfs@cdhcan01 ~]$ hadoop fs -du -h /tmp/hive/
I see the following which shows a heck of a lot of storage for 2 of the users compared to everyone else:
351.8 G 1.0 T /tmp/hive/admin 0 0 /tmp/hive/anonymous 195.7 G 587.1 G /tmp/hive/cdh-oozie 0 0 /tmp/hive/csalas 0 0 /tmp/hive/hive 0 0 /tmp/hive/jculley 0 0 /tmp/hive/jfogarty 0 0 /tmp/hive/jjohnbosco 0 0 /tmp/hive/jkarmelek 0 0 /tmp/hive/jmasloski 0 0 /tmp/hive/pscott
The cdh-oozie user runs many hiveserver2 actions on Oozie, so it makes sense to me that it has a lot of storage being used... it's a lot, but believable that it would use a lot of space for hive.
However, that admin user is the surprise and also the big hog. I kept digging into the /tmp/hive/admin/ subdirectories and found what look like sessions from six months ago, below I show where this finally led me (there are 638 items but I just show the first 2) and this looks to me like pieces of an old hive query:
[hdfs@cdhcan01 ~]$ hadoop fs -ls /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001 Found 638 items -rw-r--r-- 3 admin supergroup 441903776 2015-06-22 17:56 /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001/000000_0 -rw-r--r-- 3 admin supergroup 448117217 2015-06-22 17:55 /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001/000001_0
I'd like to go through and clean up this /tmp/hive/admin/ directory but I'm not really sure how it's getting populated?
Why wouldn't HDFS or Hive have cleaned this up on its own, especially when it looks clean for other users?
Can someone point me in the right direction of figuring out if I can go ahead and start deleting these items to free up space?
Finally what's generally going on to populate the /tmp/hive/ subdirectories and when does it get cleaned out?
Thanks for any help or insight into this!
Created 02-18-2016 12:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HS2 cleans the temporary elements if the session holding the query that created it, has terminated.
With Hue, especially on versions prior to CDH 5.2.0, you may have a situation where the admin user's sessions have never been closed/terminated, and the HS2 continues to hold references of the queries that user ran in past, whereas the other usernames are likely ending their Hue backed sessions correctly (depends on how they're working over Hue).
If you have CDH 5.2.0 or above, consider setting the various idle server-side timeouts under CM -> Hive -> Configuration (search "idle").
Created 02-18-2016 12:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HS2 cleans the temporary elements if the session holding the query that created it, has terminated.
With Hue, especially on versions prior to CDH 5.2.0, you may have a situation where the admin user's sessions have never been closed/terminated, and the HS2 continues to hold references of the queries that user ran in past, whereas the other usernames are likely ending their Hue backed sessions correctly (depends on how they're working over Hue).
If you have CDH 5.2.0 or above, consider setting the various idle server-side timeouts under CM -> Hive -> Configuration (search "idle").
Created 03-20-2016 07:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I ended up just removing these as they are orphaned data sets from failed sessions.
Created 08-03-2018 12:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am on cloudera 5.14 and i am also this issue. From version 5.2.X to the current version and this is stlil an issue? Do hue or cloudera manager have a configuration to take care of these issues?
