Support Questions

Find answers, ask questions, and share your expertise

how to identify certain keywords from a flat file, row by row

avatar
Expert Contributor

In a flat file, i have certain keywords, which are sensitive, i would like to identify these sensitive keywords row by row. These keywords could appear in any column of the flat file.

Appreciate any help.

Either in Hive or Pig anything is fine.

1 ACCEPTED SOLUTION

avatar
Master Guru

@Reddy The easiest way in my opinion to do this is via NiFi. Ingest your file via nifi, do a split text essentailly creating flow file for each line in file. Load your senstive keywords in the nifi distributed map cache. Do a lookup for each value in the row against DMC (which stores your sensitive key words). If any of the fields match the sensitive key words, you can route on text and do what ever you wish..ie store that record in a hdfs location. You can also instead of storing indivial records (the ones which have sensitive key words) on hdfs, use mergecontent to merge x number of records into a file and then store on hdfs.

View solution in original post

2 REPLIES 2

avatar
Master Guru

@Reddy The easiest way in my opinion to do this is via NiFi. Ingest your file via nifi, do a split text essentailly creating flow file for each line in file. Load your senstive keywords in the nifi distributed map cache. Do a lookup for each value in the row against DMC (which stores your sensitive key words). If any of the fields match the sensitive key words, you can route on text and do what ever you wish..ie store that record in a hdfs location. You can also instead of storing indivial records (the ones which have sensitive key words) on hdfs, use mergecontent to merge x number of records into a file and then store on hdfs.

avatar

For Pig and Hive implementations, I'd suggest you create a UDF. If new territory for you, here are some quick blog posts on creating (simple) UDFs for Pig and Hive; https://martin.atlassian.net/wiki/x/C4BRAQ and https://martin.atlassian.net/wiki/x/GoBRAQ. Good luck