Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

NIFI - Data Provenance does not being created on all processors

Explorer

Hi,

When I try to access the data provenance, for some reason on the same flow, one processor might not have the data provenance and the following processor that comes right after will have the data provenance for the same call.

Do some processors just dont write data provenance? or is there some limitation?

Thanks,

Greg

4 REPLIES 4

Provenance does not always map one-to-one with processors, some processors could produce multiple events, and others could produce no events. Is there a specific processor you are noticing this on?

Explorer

Thanks for the anwer @Bryan Bende, Are you saying that there are processors that just dont produce data provenance at all? Why dont all the processors produce provenance events? Can you give an example?

Is it an option that there isnt any more space for more provenance data to be created in the NIFI cluster so some processors will create provenance and others will not just randomly or something like that?

For example FetchHBaseRow processor should produce provenance event always?

Some processors may not do anything to do the flow file, meaning they don't modify the attributes and don't modify the content. An example would be a routing processors like RouteOnAttribute. I believe routing a flow file does not produce events because otherwise there are tricky cases like when a processor has a self-loop and may be retrying an error tons of times, you would get all these unnecessary route events back to self.

Source and destination processors are responsible for generating the send/receive/fetch events themselves because NiFi itself can't detect when a processor talks to an external system. So a processor like FetchHBaseRow should always generate a fetch event, assuming it calls session.getProvenanceReporter().fetch(flowFile, transitUri), which it does.

Other processors that are in the middle of the flow and are modifying flow files don't have to worry about generating events because the framework can detect when a flow is written to, or when it's attributes are modified, so those events will be generated automatically.

There will typically be some delay as to when the events are searchable and viewable... the provenance repo is made up of a write-ahead-log and lucene indexes, I believe the lucene indexes are updated at certain checkpoints like every 1-2 mins, and those indexes are what back the displays in the UI.

New Contributor

I had a case where PutSFTP was not producing events in case of failure.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.