My scenario is to check the fileName and then if the file contains specific word then I need to pick that file for processing.
Eg: in mydirectory I have two filenames:
Now I need to check the file names with and without "_processed" keyword and pick only files without "_processed" text in file name.
Can any one help me on this scenario.
It will be possible in with unix and spark combination.
hadoop fs -ls /filedirectory/*txt_processed
Above command will return the desired file you need. Then pass the result to spark and process the file as you need.
Alternatively in spark you can select the desired file using the below command.
val lsResult =Seq("hadoop","fs","-ls","hdfs://filedirectory/*txt_prcoessed").!!
Hope it helps !
This would also work.
val files = getListOfFiles("/tmp")
def getListOfFiles(dir: File):List[File] = dir.listFiles.filter(_.isFile).toList