Support Questions
Find answers, ask questions, and share your expertise

how to filter files containing specific text in fileName using spark.

My scenario is to check the fileName and then if the file contains specific word then I need to pick that file for processing.

Eg: in mydirectory I have two filenames:

file1: sample1.txt_processed

file2: sample2.txt

Now I need to check the file names with and without "_processed" keyword and pick only files without "_processed" text in file name.

Can any one help me on this scenario.


@Chaitanya D

It will be possible in with unix and spark combination.

hadoop fs -ls /filedirectory/*txt_processed

Above command will return the desired file you need. Then pass the result to spark and process the file as you need.

Alternatively in spark you can select the desired file using the below command.

val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!!

Hope it helps !

This would also work.

val files = getListOfFiles("/tmp")  
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList
; ;