Support Questions

Find answers, ask questions, and share your expertise

Why does /tmp/hive/admin/ take up so much space?

avatar
Explorer

Recently I noticed that Cloudera Manager is showing more data in HDFS storage than I believed I was using.

As such, I investigate via the command line, starting with the following command:

 

[hdfs@cdhcan01 ~]$ hadoop fs -du -h /

 

And I see that the /tmp directory is several hundred GB (with replication over a TB), so I dig deeper, when I check:

 

[hdfs@cdhcan01 ~]$ hadoop fs -du -h /tmp/

 

I see that the majority of this space is taken up by the /tmp/hive/ subdirectory, So looking into that:

 

[hdfs@cdhcan01 ~]$ hadoop fs -du -h /tmp/hive/

 

 

I see the following which shows a heck of a lot of storage for 2 of the users compared to everyone else:

 

351.8 G  1.0 T    /tmp/hive/admin
0        0        /tmp/hive/anonymous
195.7 G  587.1 G  /tmp/hive/cdh-oozie
0        0        /tmp/hive/csalas
0        0        /tmp/hive/hive
0        0        /tmp/hive/jculley
0        0        /tmp/hive/jfogarty
0        0        /tmp/hive/jjohnbosco
0        0        /tmp/hive/jkarmelek
0        0        /tmp/hive/jmasloski
0        0        /tmp/hive/pscott

The cdh-oozie user runs many hiveserver2 actions on Oozie, so it makes sense to me that it has a lot of storage being used... it's a lot, but believable that it would use a lot of space for hive.

 

However, that admin user is the surprise and also the big hog.  I kept digging into the /tmp/hive/admin/ subdirectories and found what look like sessions from six months ago, below I show where this finally led me (there are 638 items but I just show the first 2) and this looks to me like pieces of an old hive query:

 

[hdfs@cdhcan01 ~]$ hadoop fs -ls /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001
Found 638 items
-rw-r--r--   3 admin supergroup  441903776 2015-06-22 17:56 /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001/000000_0
-rw-r--r--   3 admin supergroup  448117217 2015-06-22 17:55 /tmp/hive/admin/8c933b36-60e5-412b-8039-408f2eb75005/hive_2015-06-22_17-33-05_894_2084771740530219258-4/-mr-10000/.hive-staging_hive_2015-06-22_17-33-05_894_2084771740530219258-4/-ext-10001/000001_0

 

 

 

I'd like to go through and clean up this /tmp/hive/admin/ directory but I'm not really sure how it's getting populated?

Why wouldn't HDFS or Hive have cleaned this up on its own, especially when it looks clean for other users?

Can someone point me in the right direction of figuring out if I can go ahead and start deleting these items to free up space?

Finally what's generally going on to populate the /tmp/hive/ subdirectories and when does it get cleaned out?

 

Thanks for any help or insight into this!

1 ACCEPTED SOLUTION

avatar
Mentor
The "admin" user is something usually used within Hue (it could of course be a valid user in your environment, but this is the only assumption I can draw).

HS2 cleans the temporary elements if the session holding the query that created it, has terminated.

With Hue, especially on versions prior to CDH 5.2.0, you may have a situation where the admin user's sessions have never been closed/terminated, and the HS2 continues to hold references of the queries that user ran in past, whereas the other usernames are likely ending their Hue backed sessions correctly (depends on how they're working over Hue).

If you have CDH 5.2.0 or above, consider setting the various idle server-side timeouts under CM -> Hive -> Configuration (search "idle").

View solution in original post

3 REPLIES 3

avatar
Mentor
The "admin" user is something usually used within Hue (it could of course be a valid user in your environment, but this is the only assumption I can draw).

HS2 cleans the temporary elements if the session holding the query that created it, has terminated.

With Hue, especially on versions prior to CDH 5.2.0, you may have a situation where the admin user's sessions have never been closed/terminated, and the HS2 continues to hold references of the queries that user ran in past, whereas the other usernames are likely ending their Hue backed sessions correctly (depends on how they're working over Hue).

If you have CDH 5.2.0 or above, consider setting the various idle server-side timeouts under CM -> Hive -> Configuration (search "idle").

avatar
Explorer

Thanks, I ended up just removing these as they are orphaned data sets from failed sessions.

avatar
Explorer
Hi Harsh,

I am on cloudera 5.14 and i am also this issue. From version 5.2.X to the current version and this is stlil an issue? Do hue or cloudera manager have a configuration to take care of these issues?