Support Questions

Find answers, ask questions, and share your expertise

Detect duplicate based on 2 attributes per second: NIFI

avatar

Hi, I am using NIFI DetectDuplicate processor to filter out duplicates per second based on 2 attributes simultaneously. My flow file contect in csv type. My DetectDuplicate processor looks like: detectduplicate.png

I am getting duplicates with exact same content in different flow-files while listing non-duplicates queue. Help is appreciated. Thanks.

1 ACCEPTED SOLUTION

avatar
Master Guru

@swati tiwari

In your Detect Duplicate Processor Change the property

Age Off Duration to No value //right now you have set the value to 1 sec.

Then the processor should work as Expected.

Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.

Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.

Configs:-

43916-detect.png

Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.

View solution in original post

7 REPLIES 7

avatar
Master Guru

@swati tiwari

In your Detect Duplicate Processor Change the property

Age Off Duration to No value //right now you have set the value to 1 sec.

Then the processor should work as Expected.

Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.

Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.

Configs:-

43916-detect.png

Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.

avatar

thanks @Shu, (1). I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec. So if I will delete age off Duration (1 sec) , how will it work. (2). And the value of cache Entry identifier in which I am trying to detect duplicates based on 2 attributes (ie, device_no and device_value) which are separated by double colon. Is this the correct way of doing this

avatar
Master Guru

@swati tiwari

Just want to make sure you are extracting the contents of CSV by using Extract Text processor and keeping device_id,device_value attributes associated with the flowfile and using Detect Duplicate processor but you are having age off duration to 1 sec.

Question1.I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec.

  • i don't know about 1 second window frame, if that is the case then your configurations are correct. what is Age off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec then
  • Let's consider you are having 2 flowfiles with same attributes i.e device_no=1 and device_value=10,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the second processed flowfile is a duplicate flowfile.
  • if first flowfile processed 2017-12-11 04:00:29 and second flowfile processed at 2017-12-11 04:00:31 then processor won't detect the second flowfile is duplicate although second flowfile having same attribute values device_no=1 and device_value=10 as first flowfile, Because the time duration between first and second flowfile processing is 2 sec(2017-12-11 04:00:29 - 2017-12-11 04:00:31) and we configured age off to 1 sec it will caches the flowfile 1 sec, if it is more than a sec the cached flowfiles are no longer compared with the new flowfiles.

2.what if I will delete age off Duration (1 sec)?

  • When we take off the age off duration then the all the cached flowfile attributes will be compared with the new flowfiles that are processed from Detect Duplicate processor until distribute map cache server reaches Maximum Cache Entries then according to Eviction Strategy we are evict the cached flowfiles to make room for new entries and there is no window frame.

avatar

Thanks @Shu

avatar
Explorer

Can "Age Off Duration" be set to 7 days to detect any duplicate files coming in last 7 days? What will be the performance impact on that?

avatar

@Vj1989 

 

As this thread was marked 'Solved' in Dec 2017 you would have a better chance of receiving a useful response by starting a new thread. This will also provide the opportunity to provide details specific to your environment that could aid others in providing a more accurate answer to your question. 

 

 

Bill Brooks, Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

avatar
Explorer

Sure. 

Fortunately in this case I was able to explore answer myself. I would remember to open a new thread from next time.