Support Questions

gmarya · ‎07-18-2016

Can someone provide the code snippet to delete a directory in HDFS using Spark/Spark-Streaming?

I am using spark-streaming to process some incoming data which is leading to blank directories in HDFS as it works on micro-batching, so I want a clean up job that can delete the empty directories.

Please provide any other suggestions as well, the solution needs to be in Java.

bwalter1 · ‎07-18-2016

Have you tried to avoid folders with empty files?

As an idea, instead of using

<DStream>
.saveAsTextFiles("/tmp/results/ts", "json");

(which creates folders with empty files if nothing gets streamed from the source), I tried

<DStream>
.foreachRDD(rdd => {
  try {
    val f = rdd.first() // fails for empty RDDs
    rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json")
  } catch {
    case e:Exception => println("empty rdd")
  }
})

It seems to work for me. No Folders with empty files any more.

View solution in original post

nyadav · ‎07-18-2016

If you are using the java code, using hadoop class can delete the hdfs path hdfs.delete(neworg.apache.hadoop.fs.Path(output),true)

In spark you may try below, haven't tried myself though. https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAHUQ+_ZwpDpfs1DaFW9zFFzJVW1PKTQ...

gmarya · ‎07-18-2016

@nyadav I found that already, any suggestions on how to delete the directories that have no data in them and leave the ones behind with data?

rajkumar_singh · ‎07-18-2016

@Gautam Marya

can you try this

val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://sandbox.hortonworks.com:8030"), sc.hadoopConfiguration)

fs.delete(new org.apache.hadoop.fs.Path("/tmp/xyz"),true) // isRecusrive= true

gmarya · ‎07-18-2016

Does this delete the directories that have no data in them and leaves the directories with data in them? The point is to only remove directories that have no data.

bwalter1 · ‎07-18-2016

Have you tried to avoid folders with empty files?

As an idea, instead of using

<DStream>
.saveAsTextFiles("/tmp/results/ts", "json");

(which creates folders with empty files if nothing gets streamed from the source), I tried

<DStream>
.foreachRDD(rdd => {
  try {
    val f = rdd.first() // fails for empty RDDs
    rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json")
  } catch {
    case e:Exception => println("empty rdd")
  }
})

It seems to work for me. No Folders with empty files any more.

bwalter1 · ‎07-18-2016

sorry, it's scala code, but java should work similar

gmarya · ‎07-19-2016

@Bernhard Walter Thanks man, it worked , wrote a similar thing java 🙂

Cloudera Community

Support Questions

Deleting Directory in HDFS using Spark