Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to add Additional Provenance Information outside Nifi

Expert Contributor



I want to execute a spark job using NiFi. At the moment NiFi does not support capturing provenance information outside NiFi. Is it possible to add additional information to the existing provenance information for example if I have a workflow that ends at a Spark job, can I add additional provenance information to the provenance information that was attached by NiFi.

I do know for cross component provenance/Lineage Atlas provides some support. At the moment it does not support Spark. I want to add some additional information regarding the job start time and end time etc while executing the Spark Job. Later I want to send back the results of the spark job to NiFi.

would it be possible to add to the existing Nifi provenance information so that when the data is ingested back in I know what happened in Spark.


Super Guru

I think you'd need a custom ExecuteSpark processor or something, that could collect some of the provenance information perhaps as metadata, to become attributes on the result flow file(s). There would be no individual provenance event for Spark per se, but you could generate a Receive event, and also the lineage would include the flow file itself, which would have the Spark provenance metadata as attribute(s).

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.