Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Compare attributes of different flowfiles

avatar
Explorer

Hello,

Is it possible to compare the attributes of two different flowfiles and only pass one if the comparisson results matched?

Thank you,

Jon

1 ACCEPTED SOLUTION

avatar
Super Mentor

@Jon Rodriguez Breton

Are you trying to see if all attributes from both FlowFiles match exactly or is their a specific attribute from each FlowFile you want to compare?

My initial thought would be to use the DetectDuplicate processor.

You could write the unique attribute to the DistributedMapCache service.
Then compare new FlowFiles against that stored value and deleted any duplicates.
That way only the first FlowFile would get passed on.

Thanks,

Matt

View solution in original post

10 REPLIES 10

avatar
Contributor

@Jon Rodriguez Breton

Have you checked out NiFi's RouteOnAttribute processor? It can compare the attributes of incoming flowfiles and handle accordingly based on the routing strategy you select.

avatar
Explorer

Yes, I've tried to use RouteOnAttribute but the thing is that I want to compare two different flowfiles attributes... and as far as I understand, RouteOnAttribute doesn't allow this kind of comparison... tell me if I'm wrong!

avatar
Contributor

Ah, I overlooked the "only pass one" goal in the original question. As @Matt Clarke mentioned, looks like DetectDuplicate might help with that part.

avatar
Super Mentor

@Jon Rodriguez Breton

Are you trying to see if all attributes from both FlowFiles match exactly or is their a specific attribute from each FlowFile you want to compare?

My initial thought would be to use the DetectDuplicate processor.

You could write the unique attribute to the DistributedMapCache service.
Then compare new FlowFiles against that stored value and deleted any duplicates.
That way only the first FlowFile would get passed on.

Thanks,

Matt

avatar
Explorer

Hi @Matt Clarke,

I just want to compare some attributes from both flowfiles... I'll try with that processor and I'll be back!

avatar
Explorer

Hi @Matt Clarke,

That processor made the trick. It's exactly what I was looking for. Thank you so much.

Best,

Jon

avatar
Super Mentor

@Jon Rodriguez Breton

Glad this worked for you.

As far as your new question:

The value written to the DistributedMapCache remains in the cache for a configured amount of time or until x configured number of entries exist. So you can compare many files against this store value. So any FlowFile that matches a stored value is consider a duplicate. It is not a one time match of a single duplicate.

It would be very expensive to build a NiFi processor that would read in large batches of queued FlowFiles form a inbound queue to do comparisons on FlowFile Attributes (FlowFile attributes live in heap memory space, so the more FlowFile you pull in to do a comparison on, the more likely you are to encounter Out Of Memory). So if you limit the size of the comparisons, how do you know a given batch contains the actual FlowFiles you want to compare?

This is why the detect duplicate makes use of an external service and compares FlowFiles against a stored value one FlowFile at a time.

Thanks,

Matt

avatar
Explorer

Hi @Matt Clarke,

Thank you. So, how about cleaning this cache eventually? Is it possible to clean it whenever a duplicate is found? I'm trying with the Eviction Strategy Property but no getting anything so far... I would like to clean the cache whenever a duplicate is found.

Thanks!

avatar
Super Mentor
@Jon Rodriguez Breton

There are no dedicated processors for removing cached entries from the distributed map cache.

You can try using the "Age Off Duration" property in the detect duplicate processor or use a scripting processor in NiFi to execute a script to clear the cache.

The follwoing Jira covers this missing processor as well as provide a sample template

https://issues.apache.org/jira/browse/NIFI-4173