Support Questions

Find answers, ask questions, and share your expertise

Hive staging directory not getting cleaned up

avatar
Expert Contributor

In CDH 5.8.0 with spark-sql insert of data there are many .hive-staging directories getting piled up and not getting deleted or removed while the insert of data is completed successfully.

 

Please let me know the reason for such behaviour and how should i get away with .hive-staging directory, is there any property we need to set ?

11 REPLIES 11

avatar
Super Guru

Hi Jais,

 

Can you please let me know where you run your Hive query? Do you run it through Hue?

 

If you run through Hue, in most cases the staging directory will be left over even after query finishes. This is because Hue holds the query handler open so that users can get back to it, and the clean up of staging directories will only be triggered when query handler is closed.

 

So first thing I would like to check is where you run your Hive query.

 

Thanks

avatar
New Contributor

I'm also having this issue on 5.7 while executing a spark action through Oozie. Any thoughs where to start looking ?

avatar
Rising Star

Can you please provide additional details on what the usecase is? are you using oozie hive1 action or hive2 action? Are these jobs failing? Please provide us a brief reproducer if you can. Thank you

avatar
Expert Contributor

Hi,

 

We run hive queries using beeline action through oozie workflow

avatar

I just boot a demon thread scheduleAtFixRate clean these "empty" and has file "_SUCCESS" directory  and another thread to run hive cmd "alter xxx concatenate"

 

Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(new Runnable {
override def run(): Unit = {
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(s"hdfs://nameservice/user/xxx/warehouse/$tableName/"))
status.foreach(stat =>
if (stat.isDirectory && stat.getPath.getName.contains("hive-staging") && fs.getContentSummary
(stat.getPath).getSpaceConsumed < 1024) {
println("empty path : " + stat.getPath)
if (directoryHasSuccess(stat.getPath, fs)) {
fs.delete(stat.getPath, true)
}
val now = new Date().getTime
if (now - stat.getModificationTime > 5 * 60 * 1000 && (now - stat.getAccessTime > 5 * 60 *
1000)) {
//5m before
println("delete path " + stat.getPath)
fs.delete(stat.getPath, true)
}
}
)
}
}, 5, interval, TimeUnit.SECONDS);

 

 

avatar
New Contributor

Hi, anyone has a solid answer for this? How can we get rid of this issue? We have thousands of such folders created. 

avatar
Guru

avatar
New Contributor

Hi everybody,

 

I experience the same issue on CDH5.5 (Spark 1.6.0) with my Spark Streaming Job. Data is read from a Kafka broker and then inserted into an hive table, paritionning by year/month/day/hour. All the data is present into the table after the insetinto() call but 'hive-staging....' directory created during the batch is still there and empty ...

 

The resources are allocated by Yarn, there are no errors logs about file creation/deletion in the executors logs. I had tested a lot of settings without any success (regarding logs persistence etc.).

 

Micro-batch is called every 10 seconds... The job will produce a lot of useless empty directories. 

avatar
New Contributor
Still have the problem on CDH-5.7.