Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Detect duplicate based on 2 attributes per second: NIFI

Solved Go to solution
Highlighted

Detect duplicate based on 2 attributes per second: NIFI

New Contributor

Hi, I am using NIFI DetectDuplicate processor to filter out duplicates per second based on 2 attributes simultaneously. My flow file contect in csv type. My DetectDuplicate processor looks like: detectduplicate.png

I am getting duplicates with exact same content in different flow-files while listing non-duplicates queue. Help is appreciated. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

Super Guru

@swati tiwari

In your Detect Duplicate Processor Change the property

Age Off Duration to No value //right now you have set the value to 1 sec.

Then the processor should work as Expected.

Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.

Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.

Configs:-

43916-detect.png

Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.

View solution in original post

7 REPLIES 7
Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

Super Guru

@swati tiwari

In your Detect Duplicate Processor Change the property

Age Off Duration to No value //right now you have set the value to 1 sec.

Then the processor should work as Expected.

Age Off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec.

Let's consider you are having 2 flowfiles with same attributes,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the duplicate flowfile. if one of the flowfile processed 29 sec and another flowfile processed at 31 sec then processor won't detect the 2 flowfile is duplicate,because we configured age off to 1 sec.

Configs:-

43916-detect.png

Once you change the property then the duplicates flowfiles will be directed to Duplicate relationship instead of non-duplicate relationship.

View solution in original post

Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

New Contributor

thanks @Shu, (1). I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec. So if I will delete age off Duration (1 sec) , how will it work. (2). And the value of cache Entry identifier in which I am trying to detect duplicates based on 2 attributes (ie, device_no and device_value) which are separated by double colon. Is this the correct way of doing this

Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

Super Guru

@swati tiwari

Just want to make sure you are extracting the contents of CSV by using Extract Text processor and keeping device_id,device_value attributes associated with the flowfile and using Detect Duplicate processor but you are having age off duration to 1 sec.

Question1.I want to delete duplicate in 1 second window frame if both attributes values (ie, device_no and device_value) already exists in past 1 sec.

  • i don't know about 1 second window frame, if that is the case then your configurations are correct. what is Age off Duration means Time interval to age off cached FlowFiles. We are caching the flow files and detecting the duplicates so when you set the property to 1 sec then
  • Let's consider you are having 2 flowfiles with same attributes i.e device_no=1 and device_value=10,if these 2 flowfiles processed through detect duplicate processor less than a sec, then only this processor can detect the second processed flowfile is a duplicate flowfile.
  • if first flowfile processed 2017-12-11 04:00:29 and second flowfile processed at 2017-12-11 04:00:31 then processor won't detect the second flowfile is duplicate although second flowfile having same attribute values device_no=1 and device_value=10 as first flowfile, Because the time duration between first and second flowfile processing is 2 sec(2017-12-11 04:00:29 - 2017-12-11 04:00:31) and we configured age off to 1 sec it will caches the flowfile 1 sec, if it is more than a sec the cached flowfiles are no longer compared with the new flowfiles.

2.what if I will delete age off Duration (1 sec)?

  • When we take off the age off duration then the all the cached flowfile attributes will be compared with the new flowfiles that are processed from Detect Duplicate processor until distribute map cache server reaches Maximum Cache Entries then according to Eviction Strategy we are evict the cached flowfiles to make room for new entries and there is no window frame.
Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

New Contributor

Thanks @Shu

Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

Explorer

Can "Age Off Duration" be set to 7 days to detect any duplicate files coming in last 7 days? What will be the performance impact on that?

Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

Community Manager

@Vj1989 

 

As this thread was marked 'Solved' in Dec 2017 you would have a better chance of receiving a useful response by starting a new thread. This will also provide the opportunity to provide details specific to your environment that could aid others in providing a more accurate answer to your question. 

 

 

Bill Brooks, Community Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Highlighted

Re: Detect duplicate based on 2 attributes per second: NIFI

Explorer

Sure. 

Fortunately in this case I was able to explore answer myself. I would remember to open a new thread from next time. 

Don't have an account?
Coming from Hortonworks? Activate your account here