Created 09-07-2017 03:12 PM
Hello,
Is it possible to compare the attributes of two different flowfiles and only pass one if the comparisson results matched?
Thank you,
Jon
Created 09-12-2017 12:45 PM
Are you trying to see if all attributes from both FlowFiles match exactly or is their a specific attribute from each FlowFile you want to compare?
My initial thought would be to use the DetectDuplicate processor.
You could write the unique attribute to the DistributedMapCache service.
Then compare new FlowFiles against that stored value and deleted any duplicates.
That way only the first FlowFile would get passed on.
Thanks,
Matt
Created 09-12-2017 12:28 PM
Have you checked out NiFi's RouteOnAttribute processor? It can compare the attributes of incoming flowfiles and handle accordingly based on the routing strategy you select.
Created 09-12-2017 12:49 PM
Yes, I've tried to use RouteOnAttribute but the thing is that I want to compare two different flowfiles attributes... and as far as I understand, RouteOnAttribute doesn't allow this kind of comparison... tell me if I'm wrong!
Created 09-12-2017 12:53 PM
Ah, I overlooked the "only pass one" goal in the original question. As @Matt Clarke mentioned, looks like DetectDuplicate might help with that part.
Created 09-12-2017 12:45 PM
Are you trying to see if all attributes from both FlowFiles match exactly or is their a specific attribute from each FlowFile you want to compare?
My initial thought would be to use the DetectDuplicate processor.
You could write the unique attribute to the DistributedMapCache service.
Then compare new FlowFiles against that stored value and deleted any duplicates.
That way only the first FlowFile would get passed on.
Thanks,
Matt
Created 09-12-2017 12:48 PM
Hi @Matt Clarke,
I just want to compare some attributes from both flowfiles... I'll try with that processor and I'll be back!
Created 09-13-2017 09:38 AM
Hi @Matt Clarke,
That processor made the trick. It's exactly what I was looking for. Thank you so much.
Best,
Jon
Created 09-13-2017 12:48 PM
Glad this worked for you.
As far as your new question:
The value written to the DistributedMapCache remains in the cache for a configured amount of time or until x configured number of entries exist. So you can compare many files against this store value. So any FlowFile that matches a stored value is consider a duplicate. It is not a one time match of a single duplicate.
It would be very expensive to build a NiFi processor that would read in large batches of queued FlowFiles form a inbound queue to do comparisons on FlowFile Attributes (FlowFile attributes live in heap memory space, so the more FlowFile you pull in to do a comparison on, the more likely you are to encounter Out Of Memory). So if you limit the size of the comparisons, how do you know a given batch contains the actual FlowFiles you want to compare?
This is why the detect duplicate makes use of an external service and compares FlowFiles against a stored value one FlowFile at a time.
Thanks,
Matt
Created 09-15-2017 12:03 PM
Hi @Matt Clarke,
Thank you. So, how about cleaning this cache eventually? Is it possible to clean it whenever a duplicate is found? I'm trying with the Eviction Strategy Property but no getting anything so far... I would like to clean the cache whenever a duplicate is found.
Thanks!
Created 09-15-2017 01:42 PM
There are no dedicated processors for removing cached entries from the distributed map cache.
You can try using the "Age Off Duration" property in the detect duplicate processor or use a scripting processor in NiFi to execute a script to clear the cache.
The follwoing Jira covers this missing processor as well as provide a sample template