Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to identify certain keywords from a flat file, row by row

Solved Go to solution
Highlighted

how to identify certain keywords from a flat file, row by row

Rising Star

In a flat file, i have certain keywords, which are sensitive, i would like to identify these sensitive keywords row by row. These keywords could appear in any column of the flat file.

Appreciate any help.

Either in Hive or Pig anything is fine.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: how to identify certain keywords from a flat file, row by row

Super Guru

@Reddy The easiest way in my opinion to do this is via NiFi. Ingest your file via nifi, do a split text essentailly creating flow file for each line in file. Load your senstive keywords in the nifi distributed map cache. Do a lookup for each value in the row against DMC (which stores your sensitive key words). If any of the fields match the sensitive key words, you can route on text and do what ever you wish..ie store that record in a hdfs location. You can also instead of storing indivial records (the ones which have sensitive key words) on hdfs, use mergecontent to merge x number of records into a file and then store on hdfs.

View solution in original post

2 REPLIES 2
Highlighted

Re: how to identify certain keywords from a flat file, row by row

Super Guru

@Reddy The easiest way in my opinion to do this is via NiFi. Ingest your file via nifi, do a split text essentailly creating flow file for each line in file. Load your senstive keywords in the nifi distributed map cache. Do a lookup for each value in the row against DMC (which stores your sensitive key words). If any of the fields match the sensitive key words, you can route on text and do what ever you wish..ie store that record in a hdfs location. You can also instead of storing indivial records (the ones which have sensitive key words) on hdfs, use mergecontent to merge x number of records into a file and then store on hdfs.

View solution in original post

Highlighted

Re: how to identify certain keywords from a flat file, row by row

For Pig and Hive implementations, I'd suggest you create a UDF. If new territory for you, here are some quick blog posts on creating (simple) UDFs for Pig and Hive; https://martin.atlassian.net/wiki/x/C4BRAQ and https://martin.atlassian.net/wiki/x/GoBRAQ. Good luck

Don't have an account?
Coming from Hortonworks? Activate your account here