Hi
After a bit of search I found that I can write each dstream RDD to specified path using the saveasTextFile method within the foreachRDD action. The problem is that this would write the partitions for the RDD to the location. If you have 3 partitions for the RDD, you will have something like
- part-0000
- part-0001
- part 0002
and this would be overwritten when the next batch starts. meaning if the following batch has 1 partition, the file 0001 and 0002 will be deleted and 0000 will be overwritten with the new data. I have seen that people have written code to merge these files. As I wanted the data for each batch and did not want to loose the data, I specified the path as follows
fileIDs.foreachRDD(rdd =>rdd.saveAsTextFile("/home/arsalan/SparkRDDData/"+ssc.sparkContext.applicationId+"/"+ System.currentTimeMillis() ))this way it would create a new folder for each batch. Later I can get the data for each batch and dont have to worry about finding ways to avoid overwriting of the files.