Support Questions
Find answers, ask questions, and share your expertise

How to clean up temporary Hive

Contributor

HI,

We have a HDFS Capacity Utilization alert on Ambari. The alert says that we have a 81% Disk usage. After check in HDFS we realized that most of the data used comme from the following folders:
- /tmp

- /user/hive/checkpoints_tmp

Could you please give us a clear procedure to clean up those folders without losing any data?

Thank you 

 

Environnement infos:

HDP-3.0.1.0

HDFS 3.1.0
YARN 3.1.0
MapReduce2 3.0.0.3.0
Hive 3.0.0.3.0
HBase 2.0.0.3.0
ZooKeeper 3.4.9.3.0
Ambari Metrics 0.1.0
Atlas 0.7.0.3.0
Kafka 1.0.0.3.0
Knox 0.5.0.3.0
Ranger 1.0.0.3.0
Kerberos 1.10.3-30

 

1 ACCEPTED SOLUTION

Accepted Solutions

Cloudera Employee

Hi , 

Mainly hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind.

The configuration details are as follows:

On the HDFS cluster, this is set to /tmp/hive- by default and is controlled by the configuration variable hive.exec.scratchdir On the client machine, this is hardcoded to /tmp/ Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table.

This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.


Reference - https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-...


Also for each session a session directory will be created in /tmp/<session-id>_resources. To check the sessions in use from /tmp, we will need to cross-reference the session ID mentioned in the /tmp/<session-id>_resources with the HS2 log.

So directories with timestamps older than hive.server2.idle.session.timeout and hive.server2.idle.operation.timeout can be deleted with respect to session directories.


Take the highest value and anything older should be able to be deleted safely.


So you can use Manual Script or Job to clean the temp Location, with regular intervals or you can cron a shell script with cleaning 30 or 60 days Data.

View solution in original post

1 REPLY 1

Cloudera Employee

Hi , 

Mainly hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind.

The configuration details are as follows:

On the HDFS cluster, this is set to /tmp/hive- by default and is controlled by the configuration variable hive.exec.scratchdir On the client machine, this is hardcoded to /tmp/ Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table.

This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.


Reference - https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration#AdminManualConfiguration-...


Also for each session a session directory will be created in /tmp/<session-id>_resources. To check the sessions in use from /tmp, we will need to cross-reference the session ID mentioned in the /tmp/<session-id>_resources with the HS2 log.

So directories with timestamps older than hive.server2.idle.session.timeout and hive.server2.idle.operation.timeout can be deleted with respect to session directories.


Take the highest value and anything older should be able to be deleted safely.


So you can use Manual Script or Job to clean the temp Location, with regular intervals or you can cron a shell script with cleaning 30 or 60 days Data.

View solution in original post