Support Questions

Jais · ‎08-15-2016

In CDH 5.8.0 with spark-sql insert of data there are many .hive-staging directories getting piled up and not getting deleted or removed while the insert of data is completed successfully.

Please let me know the reason for such behaviour and how should i get away with .hive-staging directory, is there any property we need to set ?

EricL · ‎08-27-2016

Hi Jais,

Can you please let me know where you run your Hive query? Do you run it through Hue?

If you run through Hue, in most cases the staging directory will be left over even after query finishes. This is because Hue holds the query handler open so that users can get back to it, and the clean up of staging directories will only be triggered when query handler is closed.

So first thing I would like to check is where you run your Hive query.

Thanks

rramos · ‎08-31-2016

I'm also having this issue on 5.7 while executing a spark action through Oozie. Any thoughs where to start looking ?

NaveenGangam · ‎09-02-2016

Can you please provide additional details on what the usecase is? are you using oozie hive1 action or hive2 action? Are these jobs failing? Please provide us a brief reproducer if you can. Thank you

Jais · ‎02-13-2017

Hi,

We run hive queries using beeline action through oozie workflow

terry19850289 · ‎02-13-2017

I just boot a demon thread scheduleAtFixRate clean these "empty" and has file "_SUCCESS" directory and another thread to run hive cmd "alter xxx concatenate"

Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(new Runnable {
    override def run(): Unit = {
        val fs = FileSystem.get(new Configuration())
        val status = fs.listStatus(new Path(s"hdfs://nameservice/user/xxx/warehouse/$tableName/"))
        status.foreach(stat =>
            if (stat.isDirectory && stat.getPath.getName.contains("hive-staging") && fs.getContentSummary
            (stat.getPath).getSpaceConsumed < 1024) {
                println("empty path : " + stat.getPath)
                if (directoryHasSuccess(stat.getPath, fs)) {
                    fs.delete(stat.getPath, true)
                }
                val now = new Date().getTime
                if (now - stat.getModificationTime > 5 * 60 * 1000 && (now - stat.getAccessTime > 5 * 60 *
                    1000)) {
                    //5m before
                    println("delete path " + stat.getPath)
                    fs.delete(stat.getPath, true)
                }
            }
        )
    }
}, 5, interval, TimeUnit.SECONDS);

jpmorgan_1 · ‎03-05-2021

Hi, anyone has a solid answer for this? How can we get rid of this issue? We have thousands of such folders created.

asish · ‎03-12-2021

Follow https://community.cloudera.com/t5/Support-Questions/Hive-staging-directory-not-getting-cleaned-up/td... you are running large number of queriesies through hue

johnmj · ‎11-02-2016

Hi everybody,

I experience the same issue on CDH5.5 (Spark 1.6.0) with my Spark Streaming Job. Data is read from a Kafka broker and then inserted into an hive table, paritionning by year/month/day/hour. All the data is present into the table after the insetinto() call but 'hive-staging....' directory created during the batch is still there and empty ...

The resources are allocated by Yarn, there are no errors logs about file creation/deletion in the executors logs. I had tested a lot of settings without any success (regarding logs persistence etc.).

Micro-batch is called every 10 seconds... The job will produce a lot of useless empty directories.

johnmj · ‎11-07-2016

Still have the problem on CDH-5.7.

Cloudera Community

Support Questions

Hive staging directory not getting cleaned up