I'm looking at designing a data integration/transformation layer for a large company. It is basically taking data from a data lake (most probably S3) and transforming it through a series of steps, using tools like Hive and Spark, to then serve into an analytics layer, which will most likely be Redshift + Jupyter.
It is a regulated industry so any solution needs to be enterprise level and the data provenance features of Nifi are very appealing. What is concerning me is whether this is a good fit for Nifi in terms of the content repository. Say for example we read several S3 files in Parquet format and want to offload the processing of these to something like Spark, what would get stored in the Nifi content repository. I don't think it would be good to store all the content from S3 in the Nifi content repository so I am wondering if it is possible to define a data flow which doesn't necessarily bring the raw data into the Nifi node.