Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to filter files containing specific text in fileName using spark.

Highlighted

how to filter files containing specific text in fileName using spark.

New Contributor

My scenario is to check the fileName and then if the file contains specific word then I need to pick that file for processing.

Eg: in mydirectory I have two filenames:

file1: sample1.txt_processed

file2: sample2.txt

Now I need to check the file names with and without "_processed" keyword and pick only files without "_processed" text in file name.

Can any one help me on this scenario.

2 REPLIES 2

Re: how to filter files containing specific text in fileName using spark.

@Chaitanya D

It will be possible in with unix and spark combination.

hadoop fs -ls /filedirectory/*txt_processed

Above command will return the desired file you need. Then pass the result to spark and process the file as you need.

Alternatively in spark you can select the desired file using the below command.

val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!!

Hope it helps !

Re: how to filter files containing specific text in fileName using spark.

This would also work.

import java.io.File 
val files = getListOfFiles("/tmp")  
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList
Don't have an account?
Coming from Hortonworks? Activate your account here