I have lot of text files in a repository which I want to filter using Spark. After filtering, I want the same number of filtered files as output. (for example, if I give 1000 files as input, I want corresponding 1000 filtered files as output). I want to output to retain the order of lines as it was in input)
I want to do it in the fastest way possible.
From what I understand is that if i break the files into lines and process each line in each mapper, then i run into problem of combining lines and sorting them, clustering them in reducer step. I am wondering if this is the right approach.
I am new to spark. So I am not sure the best way to do this. Any ideas?