Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive staging directory not getting cleaned up

Hive staging directory not getting cleaned up

Contributor

In CDH 5.8.0 with spark-sql insert of data there are many .hive-staging directories getting piled up and not getting deleted or removed while the insert of data is completed successfully.

 

Please let me know the reason for such behaviour and how should i get away with .hive-staging directory, is there any property we need to set ?

8 REPLIES 8

Re: Hive staging directory not getting cleaned up

Guru

Hi Jais,

 

Can you please let me know where you run your Hive query? Do you run it through Hue?

 

If you run through Hue, in most cases the staging directory will be left over even after query finishes. This is because Hue holds the query handler open so that users can get back to it, and the clean up of staging directories will only be triggered when query handler is closed.

 

So first thing I would like to check is where you run your Hive query.

 

Thanks

Re: Hive staging directory not getting cleaned up

New Contributor

I'm also having this issue on 5.7 while executing a spark action through Oozie. Any thoughs where to start looking ?

Re: Hive staging directory not getting cleaned up

Contributor

Can you please provide additional details on what the usecase is? are you using oozie hive1 action or hive2 action? Are these jobs failing? Please provide us a brief reproducer if you can. Thank you

Re: Hive staging directory not getting cleaned up

Contributor

Hi,

 

We run hive queries using beeline action through oozie workflow

Highlighted

Re: Hive staging directory not getting cleaned up

I just boot a demon thread scheduleAtFixRate clean these "empty" and has file "_SUCCESS" directory  and another thread to run hive cmd "alter xxx concatenate"

 

Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(new Runnable {
override def run(): Unit = {
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(s"hdfs://nameservice/user/xxx/warehouse/$tableName/"))
status.foreach(stat =>
if (stat.isDirectory && stat.getPath.getName.contains("hive-staging") && fs.getContentSummary
(stat.getPath).getSpaceConsumed < 1024) {
println("empty path : " + stat.getPath)
if (directoryHasSuccess(stat.getPath, fs)) {
fs.delete(stat.getPath, true)
}
val now = new Date().getTime
if (now - stat.getModificationTime > 5 * 60 * 1000 && (now - stat.getAccessTime > 5 * 60 *
1000)) {
//5m before
println("delete path " + stat.getPath)
fs.delete(stat.getPath, true)
}
}
)
}
}, 5, interval, TimeUnit.SECONDS);

 

 

Re: Hive staging directory not getting cleaned up

New Contributor

Hi everybody,

 

I experience the same issue on CDH5.5 (Spark 1.6.0) with my Spark Streaming Job. Data is read from a Kafka broker and then inserted into an hive table, paritionning by year/month/day/hour. All the data is present into the table after the insetinto() call but 'hive-staging....' directory created during the batch is still there and empty ...

 

The resources are allocated by Yarn, there are no errors logs about file creation/deletion in the executors logs. I had tested a lot of settings without any success (regarding logs persistence etc.).

 

Micro-batch is called every 10 seconds... The job will produce a lot of useless empty directories. 

Re: Hive staging directory not getting cleaned up

New Contributor
Still have the problem on CDH-5.7.

Re: Hive staging directory not getting cleaned up

Have the same problem,CDH 5.4.7.After streaming job with HiveContext

Don't have an account?
Coming from Hortonworks? Activate your account here