Support Questions

Find answers, ask questions, and share your expertise

Questions on NiFi Provenance data - storing and usage

avatar
Expert Contributor

Hi guys,

Would appreciate your input on the following NiFi Provenance questions:

  1. What are the other uses of Provenance data (if any) besides metadata, lineage, etc.
  2. What’s the best practice regarding what type of NiFi provenance data to keep/store - store all provenance data (keeping possible future use cases in mind) or extract/store only what's needed for our current use case (currently it's metadata and lineage); I'm asking this question as we're trying to build our first iteration of Data Lake and we want to follow the best practices when it comes to NiFi Provenance data that we store
  3. Provenance API versus SiteToSiteProvenanceReportingTask - from my understanding these are 2 ways of getting Provenance data, for storage and further processing; is one preferable over the other for extracting metadata and lineage.

Thank you.

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Raj B

1. The main intent of NiFi Provenance is for data governance. The ability to look back at the life of a FlowFile. It can tell you where a FlowFile originated, from what parent FlowFile it was part of, how many parents FlowFiles where used to create it, What changes were made to it, where it was sent, when it was terminated from NiFi, etc... NiFi Provenance also provides a means to view or replay FlowFile's that are no longer anywhere in your dataflow (Provided the FlowFiles content still exists in the content repositories archive) at any point in your dataflow.

Examples:

- Some downstream system expected to receive file "ABC" over the weekend from NiFi. You can use NiFi's data provenance to see exactly when file "ABC" was received by NiFi and exactly what NiFi did to file "ABC" as it traversed your dataflows.

- A FlowFile "XYZ" was expected to route through your dataflow to some destination "G". Upon searching Provenance it was discovered "XYZ" was routed down the wrong path. You could correct you dataflow routing issues and use data provenance to replay "XYZ" just prior to the dataflow correction.

2. NiFi's Provenance repository retains all Provenance events generated via your dataflow up until either retention time or max disk usage properties are met. When either of those conditions are met, the oldest provenance events are deleted first. There is no way to selectively decide which provenance events are retained in the repository. Using the

3. The Provenance API provides a means for running queries directly against the Provenance data stored local to a particular NiF instance. The SiteToSiteProvenanceReportingTask provides a way of sending provenance events to another system for perhaps longer term storage. Since provenance events do not contain any FlowFile content, only provenance events stored locally within a NiFi instance can be used to view or replay any content.

Thanks,

Matt

View solution in original post

2 REPLIES 2

avatar
Master Mentor

@Raj B

1. The main intent of NiFi Provenance is for data governance. The ability to look back at the life of a FlowFile. It can tell you where a FlowFile originated, from what parent FlowFile it was part of, how many parents FlowFiles where used to create it, What changes were made to it, where it was sent, when it was terminated from NiFi, etc... NiFi Provenance also provides a means to view or replay FlowFile's that are no longer anywhere in your dataflow (Provided the FlowFiles content still exists in the content repositories archive) at any point in your dataflow.

Examples:

- Some downstream system expected to receive file "ABC" over the weekend from NiFi. You can use NiFi's data provenance to see exactly when file "ABC" was received by NiFi and exactly what NiFi did to file "ABC" as it traversed your dataflows.

- A FlowFile "XYZ" was expected to route through your dataflow to some destination "G". Upon searching Provenance it was discovered "XYZ" was routed down the wrong path. You could correct you dataflow routing issues and use data provenance to replay "XYZ" just prior to the dataflow correction.

2. NiFi's Provenance repository retains all Provenance events generated via your dataflow up until either retention time or max disk usage properties are met. When either of those conditions are met, the oldest provenance events are deleted first. There is no way to selectively decide which provenance events are retained in the repository. Using the

3. The Provenance API provides a means for running queries directly against the Provenance data stored local to a particular NiF instance. The SiteToSiteProvenanceReportingTask provides a way of sending provenance events to another system for perhaps longer term storage. Since provenance events do not contain any FlowFile content, only provenance events stored locally within a NiFi instance can be used to view or replay any content.

Thanks,

Matt

avatar
Expert Contributor

Thanks @Matt Clarke