Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NiFi : Best practice to backup data provenance

avatar
Explorer

Hi,

We are planning to use NiFi's Data Provenance on long term to use it for audits.

What is the best way to configure NiFi ?

If I "simply" backup the Data Provenance disk content, will I be able to use it later? By reinjecting it in a working NiFi?

Should I use ReportingTaskProcessor? But again how do you query backuped data later? By keeping it in a dedicated NiFi used only for backup?

I also did not understand the management of the FlowFile contents, is it supposed to be stored in the DataProvenance Disk (which would greatly increase its size...)? Or is the "replay button" from Data Provenance UI working only if the content is still fresh and present in the "Content repository"?

Or is Data Provenance just not meant to used for long term purpose?

Sorry if mess up multiple concepts.

Thanks.

1 ACCEPTED SOLUTION

avatar
Master Mentor
@Thomas Lebrun

Provenance events are dated. While the provenance repository can be moved from one NiFi to another without issue, simply backing up a portion of it or all of it and trying to merge it with an existing provenance repository later is not possible.

-

Even trying to take an entire backed up provenance repository and placing it in a clean NiFi later would have its challenges. You would need to make sure the provenance retention settings in whatever NiFi you placed this backed up Provenance repository extended beyond the age of the oldest event in that backed up provenance repository or NiFi would simply purge all the events on startup.

-

A better option might be to consider building a dataflow on each of your NiFi instances/clusters that uses the SiteToSiteProvenanceReportingTask to send provenance events to another NiFi where it would have a dataflow build to wrote out those events to your choice of long term storage or auditing endpoint of your choice. The provenance events output by this reporting task are just JSON.

-

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-site-to-site-reporting-nar/1....

-

Thank you,

Matt

View solution in original post

2 REPLIES 2

avatar
Master Mentor
@Thomas Lebrun

Provenance events are dated. While the provenance repository can be moved from one NiFi to another without issue, simply backing up a portion of it or all of it and trying to merge it with an existing provenance repository later is not possible.

-

Even trying to take an entire backed up provenance repository and placing it in a clean NiFi later would have its challenges. You would need to make sure the provenance retention settings in whatever NiFi you placed this backed up Provenance repository extended beyond the age of the oldest event in that backed up provenance repository or NiFi would simply purge all the events on startup.

-

A better option might be to consider building a dataflow on each of your NiFi instances/clusters that uses the SiteToSiteProvenanceReportingTask to send provenance events to another NiFi where it would have a dataflow build to wrote out those events to your choice of long term storage or auditing endpoint of your choice. The provenance events output by this reporting task are just JSON.

-

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-site-to-site-reporting-nar/1....

-

Thank you,

Matt

avatar
Master Mentor

FlowFile content is not stored in provenance repository. The ability to view or replay content will only work if content still exists in content repository. Content repository can be configured to retain archived content. But keep in mind that the content of active FlowFiles still in dataflows will always take priority over archived content. If active data triggers thresholds for disk usage to exceed configured values, all archived content will be purged.

Thanks,

Matt

-

If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.