Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Apache Nifi Data Provenance: Limitations and Road Ahead?

avatar
Expert Contributor

Hi

ONE

Firstly I would want to know, where can I find what features will be added in the next NiFi version relating to data provenance.

TWO

What are the current limitations/bottlenecks when it comes to provenance?

THREE

What about provenance beyond NiFi? Meaning after making some transformations in the data I send it to Hadoop for a Map/Reduce job. Later the processed data is ingested back in. As far as I know, at the moment this will be treated as a new ingestion (A RECEIVE event)from NiFis perspective. Is there a way to know that the data I am ingesting is the same as the one that was sent out?

FOUR

Are there any other limitations or problems for which solution may be required. Also how secure is the provenance information? I do see the files inside the provenance_repository folder. Can these files be simply modified to compromise the provenance information.

FIVE

also is it possible to add additional information to the existing provenance information? is it extendable?

I intend to research data provenance at my university. This could include

  1. performance improvement
  2. security
  3. possible solution for a limitation etc
  4. a desired feature

Any help is highly appreciated. You can also point me to any link which can help me better understand the working of the current system.

Regards

Arsalan

1 ACCEPTED SOLUTION

avatar

Hello @Arsalan Siddiqi. These are some excellent questions and thoughts regarding provenance. Let me try to answer them in order.

ONE: The Apache NiFi community can definitely help you with question on specific timing of releases and what will be included. I do know though that there is work underway for around Apache NiFi's provenance repository so that it can index even more event data on a per second basis than it does today. Exactly when this will end up in a release is subject to the normal community process of when the contribution is released and reviewed and merged. That said, there is a lot of interest in having higher provenance indexing rates so I'd expect it to be in an upcoming release.

TWO: The current limitation we generally see is related to what I mention above in ONE. That is we see provenance indexing rate being a bottleneck on overall processing of data because we do cause backpressure to ensure that they backlog of provenance indexing doesn't just grow unbounded while more and more event data is processed. We are first going to make indexing faster. There are other techniques we could try later such as indexing less data which would make indexing far faster at the expense of slower queries. But such a tradeoff might make sense.

THREE: Integration with a system such as Apache Atlas has been shown to be a very compelling combination here. The provenance that NiFi generates plays nicely with the type that Atlas ingests. If we get more and more provenance enabled systems reporting to Apache Atlas then it can be the central place to view such data and get a view of what other systems are doing and thus give that nice system of systems view that people really need. To truly prove lineage across systems there would likely need to be some cryptographically verifiable techniques employed.

FOUR: The provenance data at present is prone to manipulation. In Apache NiFi we have flagged future work to adopt privacy by design features such as those which would help detect manipulated data and we're also looking at solutions to have distributed copies of the data to help with loss of availability as well.

FIVE: It is designed for extension in parts. You can for example create your own implementation of a provenance repository. You can create your own reporting tasks which can harvest data from the provenance repository and send it to other systems as desired. At the moment we don't have it open for creating additional event types. We're intentionally trying to keep the vocabulary small and succinct.

There are so many things left that we can do with this data in Apache NiFi and beyond to take full advantage of what it offers for the flow manager, the systems architect, the security professional, etc.. There is also some great inter and intra systems timing data that can be gleaned from this. Systems like to brag about how fast they are....provenance is the truth teller.

Hope that helps a bit.

Joe

View solution in original post

2 REPLIES 2

avatar
Explorer

Hi Arsalan,

Are you familiar with Atlas lineage ?

avatar

Hello @Arsalan Siddiqi. These are some excellent questions and thoughts regarding provenance. Let me try to answer them in order.

ONE: The Apache NiFi community can definitely help you with question on specific timing of releases and what will be included. I do know though that there is work underway for around Apache NiFi's provenance repository so that it can index even more event data on a per second basis than it does today. Exactly when this will end up in a release is subject to the normal community process of when the contribution is released and reviewed and merged. That said, there is a lot of interest in having higher provenance indexing rates so I'd expect it to be in an upcoming release.

TWO: The current limitation we generally see is related to what I mention above in ONE. That is we see provenance indexing rate being a bottleneck on overall processing of data because we do cause backpressure to ensure that they backlog of provenance indexing doesn't just grow unbounded while more and more event data is processed. We are first going to make indexing faster. There are other techniques we could try later such as indexing less data which would make indexing far faster at the expense of slower queries. But such a tradeoff might make sense.

THREE: Integration with a system such as Apache Atlas has been shown to be a very compelling combination here. The provenance that NiFi generates plays nicely with the type that Atlas ingests. If we get more and more provenance enabled systems reporting to Apache Atlas then it can be the central place to view such data and get a view of what other systems are doing and thus give that nice system of systems view that people really need. To truly prove lineage across systems there would likely need to be some cryptographically verifiable techniques employed.

FOUR: The provenance data at present is prone to manipulation. In Apache NiFi we have flagged future work to adopt privacy by design features such as those which would help detect manipulated data and we're also looking at solutions to have distributed copies of the data to help with loss of availability as well.

FIVE: It is designed for extension in parts. You can for example create your own implementation of a provenance repository. You can create your own reporting tasks which can harvest data from the provenance repository and send it to other systems as desired. At the moment we don't have it open for creating additional event types. We're intentionally trying to keep the vocabulary small and succinct.

There are so many things left that we can do with this data in Apache NiFi and beyond to take full advantage of what it offers for the flow manager, the systems architect, the security professional, etc.. There is also some great inter and intra systems timing data that can be gleaned from this. Systems like to brag about how fast they are....provenance is the truth teller.

Hope that helps a bit.

Joe