Support Questions
Find answers, ask questions, and share your expertise

Deleting Directory in HDFS using Spark

Explorer

Can someone provide the code snippet to delete a directory in HDFS using Spark/Spark-Streaming?

I am using spark-streaming to process some incoming data which is leading to blank directories in HDFS as it works on micro-batching, so I want a clean up job that can delete the empty directories.

Please provide any other suggestions as well, the solution needs to be in Java.

1 ACCEPTED SOLUTION

Have you tried to avoid folders with empty files?

As an idea, instead of using

<DStream>
.saveAsTextFiles("/tmp/results/ts", "json");

(which creates folders with empty files if nothing gets streamed from the source), I tried

<DStream>
.foreachRDD(rdd => {
  try {
    val f = rdd.first() // fails for empty RDDs
    rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json")
  } catch {
    case e:Exception => println("empty rdd")
  }
})

It seems to work for me. No Folders with empty files any more.

View solution in original post

7 REPLIES 7

Rising Star

If you are using the java code, using hadoop class can delete the hdfs path hdfs.delete(neworg.apache.hadoop.fs.Path(output),true)

In spark you may try below, haven't tried myself though. https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAHUQ+_ZwpDpfs1DaFW9zFFzJVW1PKTQ...

Explorer

@nyadav I found that already, any suggestions on how to delete the directories that have no data in them and leave the ones behind with data?

@Gautam Marya

can you try this

val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://sandbox.hortonworks.com:8030"), sc.hadoopConfiguration)

fs.delete(new org.apache.hadoop.fs.Path("/tmp/xyz"),true) // isRecusrive= true

Explorer

Does this delete the directories that have no data in them and leaves the directories with data in them? The point is to only remove directories that have no data.

Have you tried to avoid folders with empty files?

As an idea, instead of using

<DStream>
.saveAsTextFiles("/tmp/results/ts", "json");

(which creates folders with empty files if nothing gets streamed from the source), I tried

<DStream>
.foreachRDD(rdd => {
  try {
    val f = rdd.first() // fails for empty RDDs
    rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json")
  } catch {
    case e:Exception => println("empty rdd")
  }
})

It seems to work for me. No Folders with empty files any more.

sorry, it's scala code, but java should work similar

Explorer

@Bernhard Walter Thanks man, it worked , wrote a similar thing java 🙂

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.