Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

how to filter files containing specific text in fileName using spark.


My scenario is to check the fileName and then if the file contains specific word then I need to pick that file for processing.

Eg: in mydirectory I have two filenames:

file1: sample1.txt_processed

file2: sample2.txt

Now I need to check the file names with and without "_processed" keyword and pick only files without "_processed" text in file name.

Can any one help me on this scenario.



@Chaitanya D

It will be possible in with unix and spark combination.

hadoop fs -ls /filedirectory/*txt_processed

Above command will return the desired file you need. Then pass the result to spark and process the file as you need.

Alternatively in spark you can select the desired file using the below command.

val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!!

Hope it helps !


This would also work.

val files = getListOfFiles("/tmp")  
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList