Support Questions

Find answers, ask questions, and share your expertise

NiFI - DetectDuplicate Processor - How to move all the duplicates into Duplicate Flow instead of 1 record to Non-Duplicate flow?

avatar
Explorer

Hi ,

I am using a DetectDuplicate in my Nifi flow to identify the duplicates by combination of 2 keys.

Scenario I am trying to implement is :

Input :

Key1 Key2 A B C

Key1 Key2 A C B

Key1 Key2 A D B

Now with DetectDuplicate Processor, Its moving 2 of the above records to Duplicate Flow whereas one record to Non-Duplicate flow. Is there a way to move all these 3 records to Duplicate Flow ?

3 REPLIES 3

avatar
Master Mentor

@SamCloud 

 

The intent of the detect duplicate is to allow the first FlowFile through and then route all other FlowFiles with same cache entry identifier to duplicate.  So when a FlowFile reaches this processor it checks the Distributed Map Cache for the Cache Entry identifier, if it does not exist it is added to the cache and the FlowFile is routed to non-duplicate.  If it is found, then FlowFile is routed to duplicate.

Based on the description you provided, it is working as intended/designed.

Hope this helps,

Matt

avatar
Explorer

Hi Matt,

Its working as expected, are there any settings/configs that can help me to move even the first non-dup one to the duplicate flow when it finds a duplicate record..Meaning expectation of my requirement is to move all the 3 identical records to duplicate flow. Is it possible?

 

Thanks,

Sam

avatar
Master Mentor

@SamCloud 

 

Not really how the processor is designed to work.  It order for a FlowFile to get outed to the duplicate path the distributed map cache must already have a matching entry in it. 

All i can think for you to try is setting up a second distributedMapCache server where you use detect Duplicate a second time.  Those FlowFile routed down the duplicate connection the first time are used to populate the second cache server.  Then all your FlowFiles routed to non duplicate path originally are checked against that second distributedMapCacheServer.  Only thing here is you hav a bit of a race condition since you need to make sure the FlowFiles on the duplicate path are processed before those on the non-duplicate path.  For that you might want to use maybe the wait and notify processors. Wait on non duplicate path and the notify on after the detectDuplicate on the duplicate path.  You can see where this gets very complicated.  You may need to develop something custom here.

 

Matt