Created on 08-15-2016 02:04 PM - edited 09-16-2022 03:34 AM
In CDH 5.8.0 with spark-sql insert of data there are many .hive-staging directories getting piled up and not getting deleted or removed while the insert of data is completed successfully.
Please let me know the reason for such behaviour and how should i get away with .hive-staging directory, is there any property we need to set ?
Created 08-27-2016 12:40 AM
Hi Jais,
Can you please let me know where you run your Hive query? Do you run it through Hue?
If you run through Hue, in most cases the staging directory will be left over even after query finishes. This is because Hue holds the query handler open so that users can get back to it, and the clean up of staging directories will only be triggered when query handler is closed.
So first thing I would like to check is where you run your Hive query.
Thanks
Created 08-31-2016 08:18 AM
I'm also having this issue on 5.7 while executing a spark action through Oozie. Any thoughs where to start looking ?
Created 09-02-2016 07:25 AM
Can you please provide additional details on what the usecase is? are you using oozie hive1 action or hive2 action? Are these jobs failing? Please provide us a brief reproducer if you can. Thank you
Created 02-13-2017 04:24 PM
Hi,
We run hive queries using beeline action through oozie workflow
Created 02-13-2017 07:14 PM
I just boot a demon thread scheduleAtFixRate clean these "empty" and has file "_SUCCESS" directory and another thread to run hive cmd "alter xxx concatenate"
Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(new Runnable {
override def run(): Unit = {
val fs = FileSystem.get(new Configuration())
val status = fs.listStatus(new Path(s"hdfs://nameservice/user/xxx/warehouse/$tableName/"))
status.foreach(stat =>
if (stat.isDirectory && stat.getPath.getName.contains("hive-staging") && fs.getContentSummary
(stat.getPath).getSpaceConsumed < 1024) {
println("empty path : " + stat.getPath)
if (directoryHasSuccess(stat.getPath, fs)) {
fs.delete(stat.getPath, true)
}
val now = new Date().getTime
if (now - stat.getModificationTime > 5 * 60 * 1000 && (now - stat.getAccessTime > 5 * 60 *
1000)) {
//5m before
println("delete path " + stat.getPath)
fs.delete(stat.getPath, true)
}
}
)
}
}, 5, interval, TimeUnit.SECONDS);
Created 03-05-2021 07:29 AM
Hi, anyone has a solid answer for this? How can we get rid of this issue? We have thousands of such folders created.
Created 03-12-2021 02:39 AM
Follow https://community.cloudera.com/t5/Support-Questions/Hive-staging-directory-not-getting-cleaned-up/td... you are running large number of queriesies through hue
Created 11-02-2016 12:36 PM
Hi everybody,
I experience the same issue on CDH5.5 (Spark 1.6.0) with my Spark Streaming Job. Data is read from a Kafka broker and then inserted into an hive table, paritionning by year/month/day/hour. All the data is present into the table after the insetinto() call but 'hive-staging....' directory created during the batch is still there and empty ...
The resources are allocated by Yarn, there are no errors logs about file creation/deletion in the executors logs. I had tested a lot of settings without any success (regarding logs persistence etc.).
Micro-batch is called every 10 seconds... The job will produce a lot of useless empty directories.
Created 11-07-2016 07:49 AM