Created 02-08-2017 05:47 AM
In a flat file, i have certain keywords, which are sensitive, i would like to identify these sensitive keywords row by row. These keywords could appear in any column of the flat file.
Appreciate any help.
Either in Hive or Pig anything is fine.
Created 02-26-2017 03:52 AM
@Reddy The easiest way in my opinion to do this is via NiFi. Ingest your file via nifi, do a split text essentailly creating flow file for each line in file. Load your senstive keywords in the nifi distributed map cache. Do a lookup for each value in the row against DMC (which stores your sensitive key words). If any of the fields match the sensitive key words, you can route on text and do what ever you wish..ie store that record in a hdfs location. You can also instead of storing indivial records (the ones which have sensitive key words) on hdfs, use mergecontent to merge x number of records into a file and then store on hdfs.
Created 02-26-2017 03:52 AM
@Reddy The easiest way in my opinion to do this is via NiFi. Ingest your file via nifi, do a split text essentailly creating flow file for each line in file. Load your senstive keywords in the nifi distributed map cache. Do a lookup for each value in the row against DMC (which stores your sensitive key words). If any of the fields match the sensitive key words, you can route on text and do what ever you wish..ie store that record in a hdfs location. You can also instead of storing indivial records (the ones which have sensitive key words) on hdfs, use mergecontent to merge x number of records into a file and then store on hdfs.
Created 02-26-2017 06:18 PM
For Pig and Hive implementations, I'd suggest you create a UDF. If new territory for you, here are some quick blog posts on creating (simple) UDFs for Pig and Hive; https://martin.atlassian.net/wiki/x/C4BRAQ and https://martin.atlassian.net/wiki/x/GoBRAQ. Good luck