Support Questions
Find answers, ask questions, and share your expertise

Filtering text files using spark

Explorer

I have lot of text files in a repository which I want to filter using Spark. After filtering, I want the same number of filtered files as output. (for example, if I give 1000 files as input, I want corresponding 1000 filtered files as output). I want to output to retain the order of lines as it was in input)

I want to do it in the fastest way possible.

From what I understand is that if i break the files into lines and process each line in each mapper, then i run into problem of combining lines and sorting them, clustering them in reducer step. I am wondering if this is the right approach.

I am new to spark. So I am not sure the best way to do this. Any ideas?

1 REPLY 1

Rising Star

Hi,

it actually depends on the size of your files. If the size is not excessive, you might consider doing:

sc.wholeTextFiles("/path/to/your/files").map(f => /* your filter logic here, f._2 is the content of a file */)

In this way, every file will be read as a separate element of your RDD, you can manipulate it e do whatever you want, without needing neither combining lines back nor reordering them.

; ;