Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best practice around provenance and ingesting large files

avatar

Hi, what are the recommended approaches for handling the following scenario? NiFi is ingesting lots of files (say, pull from a remote system into the flow), and we care about file as a whole only, so flowfile content is the file, no further splits or row-by-row processing. The size of files can vary from few MBs to GBs, which is not the problem, but what happens when there are millions of files ingested this way? Say, they end up in HDFS in the dataflow.

Given that file content will be recorded in the content repository to enable data provenance, disk space may become an issue. Any way to control this purge/expiration on a more fine-grained level other than instance-wide journal settings?

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
2 REPLIES 2

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

Thanks Joe. As I understand, in this scenario we could leave provenance and flowfile repos on the local disks (regular application server sizing), but for content could mount a big fat SAN/NAS/you-name-it and configure HDF to point to that.

Are expiration policies configurable per-repository in that case?