We are running a Spark streaming job that retrieves files from a directory (using textFileStream). One concern we are having is the case where the job is down but files are still being added to the directory. Once the job starts up again, those files are not being picked up (since they are not new or changed while the job is running) but we would like them to be processed.
1) Is there a solution for that? Is there a way to keep track what files have been processed and can we "force" older files to be picked up?
2) Is there a way to delete the processed files?
For 1, you can enable checkpointing: https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing. Be careful since in older spark versions (1.X and early 2.X), checkpointing works only if the code is not changed: ie. if you change your code, before re-submitting the application with the new code you have to delete the checkpoint directory (which means that you will loose data exactly as you are experiencing now).
For 2, you have to do that on your own. In your Spark application you can collect the names of the files you have processed and then you can delete them. Be careful to delete them only when you are sure that Spark has actually processed them.