I have lot of text files in a repository which I want to filter using Spark. After filtering, I want the same number of filtered files as output. (for example, if I give 1000 files as input, I want corresponding 1000 filtered files as output). I want to output to retain the order of lines as it was in input)
I want to do it in the fastest way possible.
From what I understand is that if i break the files into lines and process each line in each mapper, then i run into problem of combining lines and sorting them, clustering them in reducer step. I am wondering if this is the right approach.
I am new to spark. So I am not sure the best way to do this. Any ideas?
it actually depends on the size of your files. If the size is not excessive, you might consider doing:
sc.wholeTextFiles("/path/to/your/files").map(f => /* your filter logic here, f._2 is the content of a file */)
In this way, every file will be read as a separate element of your RDD, you can manipulate it e do whatever you want, without needing neither combining lines back nor reordering them.