Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Filtering text files using spark


Filtering text files using spark

New Contributor

I have lot of text files in a repository which I want to filter using Spark. After filtering, I want the same number of filtered files as output. (for example, if I give 1000 files as input, I want corresponding 1000 filtered files as output). I want to output to retain the order of lines as it was in input)

I want to do it in the fastest way possible.

From what I understand is that if i break the files into lines and process each line in each mapper, then i run into problem of combining lines and sorting them, clustering them in reducer step. I am wondering if this is the right approach.

I am new to spark. So I am not sure the best way to do this. Any ideas?


Re: Filtering text files using spark

Rising Star


it actually depends on the size of your files. If the size is not excessive, you might consider doing:

sc.wholeTextFiles("/path/to/your/files").map(f => /* your filter logic here, f._2 is the content of a file */)

In this way, every file will be read as a separate element of your RDD, you can manipulate it e do whatever you want, without needing neither combining lines back nor reordering them.