Support Questions

rinkya32 · ‎07-17-2018

I have two csv files :

Sample files as below :

file1.csv:

Name,PAN,Organization,TIN

raj,Awppp1234R,Erica,EWUIP1876T

avinav,EOKLP8970Y,Optus,efgtu8976t

brijesh,Qoplo1987U,InfoGaint,rhfuo1348r

raj,Awppp1234R,Erica,EWUIP1876T

file2.csv :

Name,PAN,Organization,TIN

raj,Awppp1234R,Erica,EWUIP1876T

sanjay,RTRGH1679E,INFY,WJKOI1894G

himanshu,POLKJ1673T,data69,TVBHU186B

I want to find out unique records b/w these 2 sample files on the basis of PAN and TIN using apache nifi .

so the output should be like this :

raj,Awppp1234R,Erica,EWUIP1876T

avinav,EOKLP8970Y,Optus,efgtu8976t

brijesh,Qoplo1987U,InfoGaint,rhfuo1348r

sanjay,RTRGH1679E,INFY,WJKOI1894G

himanshu,POLKJ1673T,data69,TVBHU186B

I am new to nifi , I don't know which processors I can use to solve this problem . Please let me know the complete flow to solve this problem .

MattWho · ‎07-17-2018

@Rinky Arora

-

Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:

Template of above attached:
detect-duplicate-lines-in-csv.xml

If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.

-

Hope this help get you going.
-

Thank you,

Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

View solution in original post

MattWho · ‎07-17-2018

@Rinky Arora

-

Here is a simple flow that will compare lines of a CSV file and delete any that are duplicates:

Template of above attached:
detect-duplicate-lines-in-csv.xml

If you only want to compare the PAN and TIN CSV values only of each line and not the entire line it gets a bit more complicated.
You would then need to extract the PAN and TIN Values from the content and use the HashAttribute Processor instead of HashContent.

-

Hope this help get you going.
-

Thank you,

Matt
-
If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

MattWho · ‎07-17-2018

Here is the flow that could be used base d on just looking at PAN and TIN values in each line:

detect-duplicate-attr-in-csv.xml

MattWho · ‎07-17-2018

For either of these examples you will need to create a "demarcator" file on disk that contains a new line and then point at that file in teh assocaited config in the mergeContent processors to make sure the merged file has one FlowFile content per line.

rinkya32 · ‎07-18-2018

Thanks @Matt Clarke . This solution worked very well for me. Thanks a lot.

MattWho · ‎07-18-2018

@Rinki

Please start a new forum question. I am probably not best resource for SQL statements. Starting a new question will get you faster response.

-

Thank you,

Matt

Cloudera Community

Support Questions

How to find out duplicate rows (duplicacy to be checked on basis of 2 attributes ) in csv files using apache nifi ?

CSV to AVRO Conversion with NiFi Debugging, Checki...

Delete Row Key(s) using DeleteHBaseRow processor i...

Converting a Large JSON File into CSV

How to change csv attribute/header name in apache ...

Using PartitionRecord (GrokReader/JSONWriter) to P...

HDF/NiFi to convert row-formatted text files to co...

Nifi: Parse Error for Xml file with 2 doubled tags

Using Apache Flume Sources and Sinks with Apache N...

Counting lines in text files with NiFi - part 2

How to convert multiple related rows in CSV into n...