- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to find out duplicate rows (duplicacy to be checked on basis of 2 attributes ) in csv files using apache nifi ?
- Labels:
-
Apache NiFi
Created ‎07-17-2018 10:58 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have two csv files :
Sample files as below :
file1.csv:
Name,PAN,Organization,TIN
raj,Awppp1234R,Erica,EWUIP1876T
avinav,EOKLP8970Y,Optus,efgtu8976t
brijesh,Qoplo1987U,InfoGaint,rhfuo1348r
raj,Awppp1234R,Erica,EWUIP1876T
file2.csv :
Name,PAN,Organization,TIN
raj,Awppp1234R,Erica,EWUIP1876T
sanjay,RTRGH1679E,INFY,WJKOI1894G
himanshu,POLKJ1673T,data69,TVBHU186B
I want to find out unique records b/w these 2 sample files on the basis of PAN and TIN using apache nifi .
so the output should be like this :
raj,Awppp1234R,Erica,EWUIP1876T
avinav,EOKLP8970Y,Optus,efgtu8976t
brijesh,Qoplo1987U,InfoGaint,rhfuo1348r
sanjay,RTRGH1679E,INFY,WJKOI1894G
himanshu,POLKJ1673T,data69,TVBHU186B
I am new to nifi , I don't know which processors I can use to solve this problem . Please let me know the complete flow to solve this problem .
Created on ‎07-17-2018 05:17 PM - edited ‎08-18-2019 01:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-
Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:
Template of above attached:
detect-duplicate-lines-in-csv.xml
If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.
-
Hope this help get you going.
-
Thank you,
Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.
Created on ‎07-17-2018 05:17 PM - edited ‎08-18-2019 01:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
-
Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:
Template of above attached:
detect-duplicate-lines-in-csv.xml
If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.
-
Hope this help get you going.
-
Thank you,
Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.
Created on ‎07-17-2018 05:36 PM - edited ‎08-18-2019 01:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is the flow that could be used base d on just looking at PAN and TIN values in each line:
Created ‎07-17-2018 05:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For either of these examples you will need to create a "demarcator" file on disk that contains a new line and then point at that file in teh assocaited config in the mergeContent processors to make sure the merged file has one FlowFile content per line.
Created ‎07-18-2018 08:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Matt Clarke . This solution worked very well for me. Thanks a lot.
Created ‎07-18-2018 01:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please start a new forum question. I am probably not best resource for SQL statements. Starting a new question will get you faster response.
-
Thank you,
Matt
