Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Find and remove duplicate entries - NIFI

avatar
Explorer

Hi,

is there a way to find and remove duplicate entries in two flowfiles?

I have one flowfile generated with SQL Processor with entries from my database. The other one contains new and "old" entries. So in order to only write the new entries in the database, I have to detect and remove the entries that already exist in the other flowfile.

I already tried the HashContentProcessor but it hashes the content of the whole file. I would need a processor that hashes line for line (and then compares all hashes with each other).

Thanks for your help!

1 ACCEPTED SOLUTION

avatar
Master Mentor

@code 

Have you considered using GenerateTableFetchQueryDatabaseTable, or QueryDatabaseTableRecord   that generates SQL  that you then feed to the ExecuteSQL to avoid getting old and new entries with each execution of your existing flow?  Avoiding ingesting duplicate entries is better then trying to find duplicate entries across multiple FlowFiles.

You can detect duplicates within a single FlowFile using DeduplicateRecord; however, this requires all records are merged in to a single FlowFile.
You can use DetectDuplicate; however, this requires that each FlowFile contains one entry to compare.
Using these methods add a lot of additional processing in your dataflows or holding of records longer then you want in your flow and this probably not the best/most ideal solution.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

1 REPLY 1

avatar
Master Mentor

@code 

Have you considered using GenerateTableFetchQueryDatabaseTable, or QueryDatabaseTableRecord   that generates SQL  that you then feed to the ExecuteSQL to avoid getting old and new entries with each execution of your existing flow?  Avoiding ingesting duplicate entries is better then trying to find duplicate entries across multiple FlowFiles.

You can detect duplicates within a single FlowFile using DeduplicateRecord; however, this requires all records are merged in to a single FlowFile.
You can use DetectDuplicate; however, this requires that each FlowFile contains one entry to compare.
Using these methods add a lot of additional processing in your dataflows or holding of records longer then you want in your flow and this probably not the best/most ideal solution.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt