Created 12-11-2017 02:50 AM
Hi, I am using NIFI DetectDuplicate processor to filter out duplicates per second based on 2 attributes simultaneously. My flow file contect in csv type. My DetectDuplicate processor looks like: detectduplicate.png
I am getting duplicates with exact same content in different flow-files while listing non-duplicates queue. Help is appreciated. Thanks.
Created on 12-11-2017 03:27 AM - edited 08-17-2019 07:09 PM
In your Detect Duplicate Processor Change the property
Age Off Duration to No value //right now you have set the value to 1 sec.
Then the processor should work as Expected.
Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.
Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.
Configs:-
Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.
Created on 12-11-2017 03:27 AM - edited 08-17-2019 07:09 PM
In your Detect Duplicate Processor Change the property
Age Off Duration to No value //right now you have set the value to 1 sec.
Then the processor should work as Expected.
Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.
Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.
Configs:-
Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.
Created 12-11-2017 07:20 AM
thanks @Shu, (1). I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec. So if I will delete age off Duration (1 sec) , how will it work. (2). And the value of cache Entry identifier in which I am trying to detect duplicates based on 2 attributes (ie, device_no and device_value) which are separated by double colon. Is this the correct way of doing this
Created 12-11-2017 04:52 PM
Just want to make sure you are extracting the contents of CSV by using Extract Text processor and keeping device_id,device_value attributes associated with the flowfile and using Detect Duplicate processor but you are having age off duration to 1 sec.
Question1.I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec.
2.what if I will delete age off Duration (1 sec)?
Created 12-12-2017 03:10 AM
Thanks @Shu
Created 03-11-2020 09:37 PM
Can "Age Off Duration" be set to 7 days to detect any duplicate files coming in last 7 days? What will be the performance impact on that?
Created 03-11-2020 09:50 PM
As this thread was marked 'Solved' in Dec 2017 you would have a better chance of receiving a useful response by starting a new thread. This will also provide the opportunity to provide details specific to your environment that could aid others in providing a more accurate answer to your question.
Created 04-10-2020 04:02 PM
Sure.
Fortunately in this case I was able to explore answer myself. I would remember to open a new thread from next time.