- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Detect duplicate based on 2 attributes per second: NIFI
- Labels:
-
Apache NiFi
Created ‎12-11-2017 02:50 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I am using NIFI DetectDuplicate processor to filter out duplicates per second based on 2 attributes simultaneously. My flow file contect in csv type. My DetectDuplicate processor looks like: detectduplicate.png
I am getting duplicates with exact same content in different flow-files while listing non-duplicates queue. Help is appreciated. Thanks.
Created on ‎12-11-2017 03:27 AM - edited ‎08-17-2019 07:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your Detect Duplicate Processor Change the property
Age Off Duration to No value //right now you have set the value to 1 sec.
Then the processor should work as Expected.
Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.
Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.
Configs:-
Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.
Created on ‎12-11-2017 03:27 AM - edited ‎08-17-2019 07:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your Detect Duplicate Processor Change the property
Age Off Duration to No value //right now you have set the value to 1 sec.
Then the processor should work as Expected.
Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.
Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.
Configs:-
Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.
Created ‎12-11-2017 07:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks @Shu, (1). I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec. So if I will delete age off Duration (1 sec) , how will it work. (2). And the value of cache Entry identifier in which I am trying to detect duplicates based on 2 attributes (ie, device_no and device_value) which are separated by double colon. Is this the correct way of doing this
Created ‎12-11-2017 04:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just want to make sure you are extracting the contents of CSV by using Extract Text processor and keeping device_id,device_value attributes associated with the flowfile and using Detect Duplicate processor but you are having age off duration to 1 sec.
Question1.I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec.
- i don't know about 1 second window frame, if that is the case then your configurations are correct. what is Age off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec then
- Let's consider you are having 2 flowfiles with same attributes i.e device_no=1 and device_value=10,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the second processed flowfile is a duplicate flowfile.
- if first flowfile processed 2017-12-11 04:00:29 and second flowfile processed at 2017-12-11 04:00:31 then processor won't detect the second flowfile is duplicate although second flowfile having same attribute values device_no=1 and device_value=10 as first flowfile, Because the time duration between first and second flowfile processing is 2 sec(2017-12-11 04:00:29 - 2017-12-11 04:00:31) and we configured age off to 1 sec it will caches the flowfile 1 sec, if it is more than a sec the cached flowfiles are no longer compared with the new flowfiles.
2.what if I will delete age off Duration (1 sec)?
- When we take off the age off duration then the all the cached flowfile attributes will be compared with the new flowfiles that are processed from Detect Duplicate processor until distribute map cache server reaches Maximum Cache Entries then according to Eviction Strategy we are evict the cached flowfiles to make room for new entries and there is no window frame.
Created ‎12-12-2017 03:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Shu
Created ‎03-11-2020 09:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can "Age Off Duration" be set to 7 days to detect any duplicate files coming in last 7 days? What will be the performance impact on that?
Created ‎03-11-2020 09:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As this thread was marked 'Solved' in Dec 2017 you would have a better chance of receiving a useful response by starting a new thread. This will also provide the opportunity to provide details specific to your environment that could aid others in providing a more accurate answer to your question.
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Created ‎04-10-2020 04:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure.
Fortunately in this case I was able to explore answer myself. I would remember to open a new thread from next time.
