Created 07-18-2016 07:45 AM
Can someone provide the code snippet to delete a directory in HDFS using Spark/Spark-Streaming?
I am using spark-streaming to process some incoming data which is leading to blank directories in HDFS as it works on micro-batching, so I want a clean up job that can delete the empty directories.
Please provide any other suggestions as well, the solution needs to be in Java.
Created 07-18-2016 09:27 AM
Have you tried to avoid folders with empty files?
As an idea, instead of using
<DStream> .saveAsTextFiles("/tmp/results/ts", "json");
(which creates folders with empty files if nothing gets streamed from the source), I tried
<DStream> .foreachRDD(rdd => { try { val f = rdd.first() // fails for empty RDDs rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json") } catch { case e:Exception => println("empty rdd") } })
It seems to work for me. No Folders with empty files any more.
Created 07-18-2016 08:00 AM
If you are using the java code, using hadoop class can delete the hdfs path hdfs.delete(neworg.apache.hadoop.fs.Path(output),true)
In spark you may try below, haven't tried myself though. https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAHUQ+_ZwpDpfs1DaFW9zFFzJVW1PKTQ...
Created 07-18-2016 08:02 AM
@nyadav I found that already, any suggestions on how to delete the directories that have no data in them and leave the ones behind with data?
Created 07-18-2016 08:14 AM
can you try this
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://sandbox.hortonworks.com:8030"), sc.hadoopConfiguration)
fs.delete(new org.apache.hadoop.fs.Path("/tmp/xyz"),true) // isRecusrive= true
Created 07-18-2016 08:25 AM
Does this delete the directories that have no data in them and leaves the directories with data in them? The point is to only remove directories that have no data.
Created 07-18-2016 09:27 AM
Have you tried to avoid folders with empty files?
As an idea, instead of using
<DStream> .saveAsTextFiles("/tmp/results/ts", "json");
(which creates folders with empty files if nothing gets streamed from the source), I tried
<DStream> .foreachRDD(rdd => { try { val f = rdd.first() // fails for empty RDDs rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json") } catch { case e:Exception => println("empty rdd") } })
It seems to work for me. No Folders with empty files any more.
Created 07-18-2016 09:30 AM
sorry, it's scala code, but java should work similar
Created 07-19-2016 07:41 AM
@Bernhard Walter Thanks man, it worked , wrote a similar thing java 🙂