Support Questions

code · ‎08-04-2022

Hi,

is there a way to find and remove duplicate entries in two flowfiles?

I have one flowfile generated with SQL Processor with entries from my database. The other one contains new and "old" entries. So in order to only write the new entries in the database, I have to detect and remove the entries that already exist in the other flowfile.

I already tried the HashContentProcessor but it hashes the content of the whole file. I would need a processor that hashes line for line (and then compares all hashes with each other).

Thanks for your help!

MattWho · ‎08-04-2022

@code

Have you considered using GenerateTableFetch, QueryDatabaseTable, or QueryDatabaseTableRecord that generates SQL that you then feed to the ExecuteSQL to avoid getting old and new entries with each execution of your existing flow? Avoiding ingesting duplicate entries is better then trying to find duplicate entries across multiple FlowFiles.

You can detect duplicates within a single FlowFile using DeduplicateRecord; however, this requires all records are merged in to a single FlowFile.
You can use DetectDuplicate; however, this requires that each FlowFile contains one entry to compare.
Using these methods add a lot of additional processing in your dataflows or holding of records longer then you want in your flow and this probably not the best/most ideal solution.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

MattWho · ‎08-04-2022

@code

Have you considered using GenerateTableFetch, QueryDatabaseTable, or QueryDatabaseTableRecord that generates SQL that you then feed to the ExecuteSQL to avoid getting old and new entries with each execution of your existing flow? Avoiding ingesting duplicate entries is better then trying to find duplicate entries across multiple FlowFiles.

You can detect duplicates within a single FlowFile using DeduplicateRecord; however, this requires all records are merged in to a single FlowFile.
You can use DetectDuplicate; however, this requires that each FlowFile contains one entry to compare.
Using these methods add a lot of additional processing in your dataflows or holding of records longer then you want in your flow and this probably not the best/most ideal solution.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

Cloudera Community

Support Questions

Find and remove duplicate entries - NIFI