Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to find out duplicate rows (duplicacy to be checked on basis of 2 attributes ) in csv files using apache nifi ?

avatar
New Member

I have two csv files :

Sample files as below :

file1.csv:

Name,PAN,Organization,TIN

raj,Awppp1234R,Erica,EWUIP1876T

avinav,EOKLP8970Y,Optus,efgtu8976t

brijesh,Qoplo1987U,InfoGaint,rhfuo1348r

raj,Awppp1234R,Erica,EWUIP1876T


file2.csv :

Name,PAN,Organization,TIN

raj,Awppp1234R,Erica,EWUIP1876T

sanjay,RTRGH1679E,INFY,WJKOI1894G

himanshu,POLKJ1673T,data69,TVBHU186B

I want to find out unique records b/w these 2 sample files on the basis of PAN and TIN using apache nifi .

so the output should be like this :

raj,Awppp1234R,Erica,EWUIP1876T

avinav,EOKLP8970Y,Optus,efgtu8976t

brijesh,Qoplo1987U,InfoGaint,rhfuo1348r

sanjay,RTRGH1679E,INFY,WJKOI1894G

himanshu,POLKJ1673T,data69,TVBHU186B

I am new to nifi , I don't know which processors I can use to solve this problem . Please let me know the complete flow to solve this problem .


1 ACCEPTED SOLUTION

avatar
Master Mentor
@Rinky Arora

-

Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:

80576-screen-shot-2018-07-17-at-10702-pm.png

Template of above attached:
detect-duplicate-lines-in-csv.xml

If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.

-

Hope this help get you going.
-

Thank you,

Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

View solution in original post

5 REPLIES 5

avatar
Master Mentor
@Rinky Arora

-

Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:

80576-screen-shot-2018-07-17-at-10702-pm.png

Template of above attached:
detect-duplicate-lines-in-csv.xml

If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.

-

Hope this help get you going.
-

Thank you,

Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

avatar
Master Mentor

Here is the flow that could be used base d on just looking at PAN and TIN values in each line:

80578-screen-shot-2018-07-17-at-12703-pm.png

detect-duplicate-attr-in-csv.xml

avatar
Master Mentor

For either of these examples you will need to create a "demarcator" file on disk that contains a new line and then point at that file in teh assocaited config in the mergeContent processors to make sure the merged file has one FlowFile content per line.

avatar
New Member

Thanks @Matt Clarke . This solution worked very well for me. Thanks a lot.

avatar
Master Mentor

@Rinki

Please start a new forum question. I am probably not best resource for SQL statements. Starting a new question will get you faster response.

-

Thank you,

Matt