Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Compare attributes of different flowfiles

Solved Go to solution
Highlighted

Compare attributes of different flowfiles

Explorer

Hello,

Is it possible to compare the attributes of two different flowfiles and only pass one if the comparisson results matched?

Thank you,

Jon

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Compare attributes of different flowfiles

Master Guru

@Jon Rodriguez Breton

Are you trying to see if all attributes from both FlowFiles match exactly or is their a specific attribute from each FlowFile you want to compare?

My initial thought would be to use the DetectDuplicate processor.

You could write the unique attribute to the DistributedMapCache service.
Then compare new FlowFiles against that stored value and deleted any duplicates.
That way only the first FlowFile would get passed on.

Thanks,

Matt

View solution in original post

10 REPLIES 10
Highlighted

Re: Compare attributes of different flowfiles

Cloudera Employee

@Jon Rodriguez Breton

Have you checked out NiFi's RouteOnAttribute processor? It can compare the attributes of incoming flowfiles and handle accordingly based on the routing strategy you select.

Highlighted

Re: Compare attributes of different flowfiles

Explorer

Yes, I've tried to use RouteOnAttribute but the thing is that I want to compare two different flowfiles attributes... and as far as I understand, RouteOnAttribute doesn't allow this kind of comparison... tell me if I'm wrong!

Re: Compare attributes of different flowfiles

Cloudera Employee

Ah, I overlooked the "only pass one" goal in the original question. As @Matt Clarke mentioned, looks like DetectDuplicate might help with that part.

Highlighted

Re: Compare attributes of different flowfiles

Master Guru

@Jon Rodriguez Breton

Are you trying to see if all attributes from both FlowFiles match exactly or is their a specific attribute from each FlowFile you want to compare?

My initial thought would be to use the DetectDuplicate processor.

You could write the unique attribute to the DistributedMapCache service.
Then compare new FlowFiles against that stored value and deleted any duplicates.
That way only the first FlowFile would get passed on.

Thanks,

Matt

View solution in original post

Highlighted

Re: Compare attributes of different flowfiles

Explorer

Hi @Matt Clarke,

I just want to compare some attributes from both flowfiles... I'll try with that processor and I'll be back!

Highlighted

Re: Compare attributes of different flowfiles

Explorer

Hi @Matt Clarke,

That processor made the trick. It's exactly what I was looking for. Thank you so much.

Best,

Jon

Highlighted

Re: Compare attributes of different flowfiles

Master Guru

@Jon Rodriguez Breton

Glad this worked for you.

As far as your new question:

The value written to the DistributedMapCache remains in the cache for a configured amount of time or until x configured number of entries exist. So you can compare many files against this store value. So any FlowFile that matches a stored value is consider a duplicate. It is not a one time match of a single duplicate.

It would be very expensive to build a NiFi processor that would read in large batches of queued FlowFiles form a inbound queue to do comparisons on FlowFile Attributes (FlowFile attributes live in heap memory space, so the more FlowFile you pull in to do a comparison on, the more likely you are to encounter Out Of Memory). So if you limit the size of the comparisons, how do you know a given batch contains the actual FlowFiles you want to compare?

This is why the detect duplicate makes use of an external service and compares FlowFiles against a stored value one FlowFile at a time.

Thanks,

Matt

Highlighted

Re: Compare attributes of different flowfiles

Explorer

Hi @Matt Clarke,

Thank you. So, how about cleaning this cache eventually? Is it possible to clean it whenever a duplicate is found? I'm trying with the Eviction Strategy Property but no getting anything so far... I would like to clean the cache whenever a duplicate is found.

Thanks!

Highlighted

Re: Compare attributes of different flowfiles

Master Guru
@Jon Rodriguez Breton

There are no dedicated processors for removing cached entries from the distributed map cache.

You can try using the "Age Off Duration" property in the detect duplicate processor or use a scripting processor in NiFi to execute a script to clear the cache.

The follwoing Jira covers this missing processor as well as provide a sample template

https://issues.apache.org/jira/browse/NIFI-4173

Don't have an account?
Coming from Hortonworks? Activate your account here