Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Deleting Directory in HDFS using Spark

avatar
Contributor

Can someone provide the code snippet to delete a directory in HDFS using Spark/Spark-Streaming?

I am using spark-streaming to process some incoming data which is leading to blank directories in HDFS as it works on micro-batching, so I want a clean up job that can delete the empty directories.

Please provide any other suggestions as well, the solution needs to be in Java.

1 ACCEPTED SOLUTION

avatar

Have you tried to avoid folders with empty files?

As an idea, instead of using

<DStream>
.saveAsTextFiles("/tmp/results/ts", "json");

(which creates folders with empty files if nothing gets streamed from the source), I tried

<DStream>
.foreachRDD(rdd => {
  try {
    val f = rdd.first() // fails for empty RDDs
    rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json")
  } catch {
    case e:Exception => println("empty rdd")
  }
})

It seems to work for me. No Folders with empty files any more.

View solution in original post

7 REPLIES 7

avatar
Expert Contributor

If you are using the java code, using hadoop class can delete the hdfs path hdfs.delete(neworg.apache.hadoop.fs.Path(output),true)

In spark you may try below, haven't tried myself though. https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAHUQ+_ZwpDpfs1DaFW9zFFzJVW1PKTQ...

avatar
Contributor

@nyadav I found that already, any suggestions on how to delete the directories that have no data in them and leave the ones behind with data?

avatar
Super Guru

@Gautam Marya

can you try this

val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://sandbox.hortonworks.com:8030"), sc.hadoopConfiguration)

fs.delete(new org.apache.hadoop.fs.Path("/tmp/xyz"),true) // isRecusrive= true

avatar
Contributor

Does this delete the directories that have no data in them and leaves the directories with data in them? The point is to only remove directories that have no data.

avatar

Have you tried to avoid folders with empty files?

As an idea, instead of using

<DStream>
.saveAsTextFiles("/tmp/results/ts", "json");

(which creates folders with empty files if nothing gets streamed from the source), I tried

<DStream>
.foreachRDD(rdd => {
  try {
    val f = rdd.first() // fails for empty RDDs
    rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json")
  } catch {
    case e:Exception => println("empty rdd")
  }
})

It seems to work for me. No Folders with empty files any more.

avatar

sorry, it's scala code, but java should work similar

avatar
Contributor

@Bernhard Walter Thanks man, it worked , wrote a similar thing java 🙂