Support Questions
Find answers, ask questions, and share your expertise

Filtering text files using spark


I have lot of text files in a repository which I want to filter using Spark. After filtering, I want the same number of filtered files as output. (for example, if I give 1000 files as input, I want corresponding 1000 filtered files as output). I want to output to retain the order of lines as it was in input)

I want to do it in the fastest way possible.

From what I understand is that if i break the files into lines and process each line in each mapper, then i run into problem of combining lines and sorting them, clustering them in reducer step. I am wondering if this is the right approach.

I am new to spark. So I am not sure the best way to do this. Any ideas?


Rising Star


it actually depends on the size of your files. If the size is not excessive, you might consider doing:

sc.wholeTextFiles("/path/to/your/files").map(f => /* your filter logic here, f._2 is the content of a file */)

In this way, every file will be read as a separate element of your RDD, you can manipulate it e do whatever you want, without needing neither combining lines back nor reordering them.

; ;