Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to find out duplicate rows (duplicacy to be checked on basis of 2 attributes ) in csv files using apache nifi ?

avatar
Explorer

I have two csv files :

Sample files as below :

file1.csv:

Name,PAN,Organization,TIN

raj,Awppp1234R,Erica,EWUIP1876T

avinav,EOKLP8970Y,Optus,efgtu8976t

brijesh,Qoplo1987U,InfoGaint,rhfuo1348r

raj,Awppp1234R,Erica,EWUIP1876T


file2.csv :

Name,PAN,Organization,TIN

raj,Awppp1234R,Erica,EWUIP1876T

sanjay,RTRGH1679E,INFY,WJKOI1894G

himanshu,POLKJ1673T,data69,TVBHU186B

I want to find out unique records b/w these 2 sample files on the basis of PAN and TIN using apache nifi .

so the output should be like this :

raj,Awppp1234R,Erica,EWUIP1876T

avinav,EOKLP8970Y,Optus,efgtu8976t

brijesh,Qoplo1987U,InfoGaint,rhfuo1348r

sanjay,RTRGH1679E,INFY,WJKOI1894G

himanshu,POLKJ1673T,data69,TVBHU186B

I am new to nifi , I don't know which processors I can use to solve this problem . Please let me know the complete flow to solve this problem .


1 ACCEPTED SOLUTION

avatar
Super Mentor
@Rinky Arora

-

Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:

80576-screen-shot-2018-07-17-at-10702-pm.png

Template of above attached:
detect-duplicate-lines-in-csv.xml

If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.

-

Hope this help get you going.
-

Thank you,

Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

View solution in original post

5 REPLIES 5

avatar
Super Mentor
@Rinky Arora

-

Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:

80576-screen-shot-2018-07-17-at-10702-pm.png

Template of above attached:
detect-duplicate-lines-in-csv.xml

If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.

-

Hope this help get you going.
-

Thank you,

Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

avatar
Super Mentor

Here is the flow that could be used base d on just looking at PAN and TIN values in each line:

80578-screen-shot-2018-07-17-at-12703-pm.png

detect-duplicate-attr-in-csv.xml

avatar
Super Mentor

For either of these examples you will need to create a "demarcator" file on disk that contains a new line and then point at that file in teh assocaited config in the mergeContent processors to make sure the merged file has one FlowFile content per line.

avatar
Explorer

Thanks @Matt Clarke . This solution worked very well for me. Thanks a lot.

avatar
Super Mentor

@Rinki

Please start a new forum question. I am probably not best resource for SQL statements. Starting a new question will get you faster response.

-

Thank you,

Matt