Support Questions
Find answers, ask questions, and share your expertise

Find and remove duplicate entries - NIFI

New Contributor

Hi,

is there a way to find and remove duplicate entries in two flowfiles?

I have one flowfile generated with SQL Processor with entries from my database. The other one contains new and "old" entries. So in order to only write the new entries in the database, I have to detect and remove the entries that already exist in the other flowfile.

I already tried the HashContentProcessor but it hashes the content of the whole file. I would need a processor that hashes line for line (and then compares all hashes with each other).

Thanks for your help!

1 ACCEPTED SOLUTION

Master Guru

@code 

Have you considered using GenerateTableFetchQueryDatabaseTable, or QueryDatabaseTableRecord   that generates SQL  that you then feed to the ExecuteSQL to avoid getting old and new entries with each execution of your existing flow?  Avoiding ingesting duplicate entries is better then trying to find duplicate entries across multiple FlowFiles.

You can detect duplicates within a single FlowFile using DeduplicateRecord; however, this requires all records are merged in to a single FlowFile.
You can use DetectDuplicate; however, this requires that each FlowFile contains one entry to compare.
Using these methods add a lot of additional processing in your dataflows or holding of records longer then you want in your flow and this probably not the best/most ideal solution.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

1 REPLY 1

Master Guru

@code 

Have you considered using GenerateTableFetchQueryDatabaseTable, or QueryDatabaseTableRecord   that generates SQL  that you then feed to the ExecuteSQL to avoid getting old and new entries with each execution of your existing flow?  Avoiding ingesting duplicate entries is better then trying to find duplicate entries across multiple FlowFiles.

You can detect duplicates within a single FlowFile using DeduplicateRecord; however, this requires all records are merged in to a single FlowFile.
You can use DetectDuplicate; however, this requires that each FlowFile contains one entry to compare.
Using these methods add a lot of additional processing in your dataflows or holding of records longer then you want in your flow and this probably not the best/most ideal solution.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.