Member since
05-22-2017
2
Posts
0
Kudos Received
0
Solutions
02-08-2018
03:35 PM
For 1, you can enable checkpointing: https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing. Be careful since in older spark versions (1.X and early 2.X), checkpointing works only if the code is not changed: ie. if you change your code, before re-submitting the application with the new code you have to delete the checkpoint directory (which means that you will loose data exactly as you are experiencing now). For 2, you have to do that on your own. In your Spark application you can collect the names of the files you have processed and then you can delete them. Be careful to delete them only when you are sure that Spark has actually processed them.
... View more