- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Deleting Directory in HDFS using Spark
- Labels:
-
Apache Hadoop
-
Apache Spark
Created ‎07-18-2016 07:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can someone provide the code snippet to delete a directory in HDFS using Spark/Spark-Streaming?
I am using spark-streaming to process some incoming data which is leading to blank directories in HDFS as it works on micro-batching, so I want a clean up job that can delete the empty directories.
Please provide any other suggestions as well, the solution needs to be in Java.
Created ‎07-18-2016 09:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you tried to avoid folders with empty files?
As an idea, instead of using
<DStream> .saveAsTextFiles("/tmp/results/ts", "json");
(which creates folders with empty files if nothing gets streamed from the source), I tried
<DStream> .foreachRDD(rdd => { try { val f = rdd.first() // fails for empty RDDs rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json") } catch { case e:Exception => println("empty rdd") } })
It seems to work for me. No Folders with empty files any more.
Created ‎07-18-2016 08:00 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are using the java code, using hadoop class can delete the hdfs path hdfs.delete(neworg.apache.hadoop.fs.Path(output),true)
In spark you may try below, haven't tried myself though. https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAHUQ+_ZwpDpfs1DaFW9zFFzJVW1PKTQ...
Created ‎07-18-2016 08:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@nyadav I found that already, any suggestions on how to delete the directories that have no data in them and leave the ones behind with data?
Created ‎07-18-2016 08:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
can you try this
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://sandbox.hortonworks.com:8030"), sc.hadoopConfiguration)
fs.delete(new org.apache.hadoop.fs.Path("/tmp/xyz"),true) // isRecusrive= true
Created ‎07-18-2016 08:25 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does this delete the directories that have no data in them and leaves the directories with data in them? The point is to only remove directories that have no data.
Created ‎07-18-2016 09:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you tried to avoid folders with empty files?
As an idea, instead of using
<DStream> .saveAsTextFiles("/tmp/results/ts", "json");
(which creates folders with empty files if nothing gets streamed from the source), I tried
<DStream> .foreachRDD(rdd => { try { val f = rdd.first() // fails for empty RDDs rdd.saveAsTextFile(s"/tmp/results/ts-${System.currentTimeMillis}.json") } catch { case e:Exception => println("empty rdd") } })
It seems to work for me. No Folders with empty files any more.
Created ‎07-18-2016 09:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry, it's scala code, but java should work similar
Created ‎07-19-2016 07:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Bernhard Walter Thanks man, it worked , wrote a similar thing java 🙂
