Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best practice around provenance and ingesting large files

avatar

Hi, what are the recommended approaches for handling the following scenario? NiFi is ingesting lots of files (say, pull from a remote system into the flow), and we care about file as a whole only, so flowfile content is the file, no further splits or row-by-row processing. The size of files can vary from few MBs to GBs, which is not the problem, but what happens when there are millions of files ingested this way? Say, they end up in HDFS in the dataflow.

Given that file content will be recorded in the content repository to enable data provenance, disk space may become an issue. Any way to control this purge/expiration on a more fine-grained level other than instance-wide journal settings?

1 ACCEPTED SOLUTION

avatar

Andrew - great question. There are three main repositories in NiFi to consider and you want to provision the amount of space they can use on the system according to your goals. These are the 'flowfile repo' the 'content repo' and the 'provenance repo'. It is a best practice to have them in OS recognized partitions and even more ideal is that they align to physically unique resources.

This is also helpful because the amount of provenance data you can keep and the amount of content you can keep around and the length of time over which they cover can vary and fluctuate independently of eachother. We remove items from the content repository based on a pretty simple oldest first type model and as needed to meet the configuration goals.

So would you like have the ability avoid that large objects take up too much of the content repository/archive quota so that you can ideally keep some of the smaller objects longer?

View solution in original post

2 REPLIES 2

avatar

Andrew - great question. There are three main repositories in NiFi to consider and you want to provision the amount of space they can use on the system according to your goals. These are the 'flowfile repo' the 'content repo' and the 'provenance repo'. It is a best practice to have them in OS recognized partitions and even more ideal is that they align to physically unique resources.

This is also helpful because the amount of provenance data you can keep and the amount of content you can keep around and the length of time over which they cover can vary and fluctuate independently of eachother. We remove items from the content repository based on a pretty simple oldest first type model and as needed to meet the configuration goals.

So would you like have the ability avoid that large objects take up too much of the content repository/archive quota so that you can ideally keep some of the smaller objects longer?

avatar

Thanks Joe. As I understand, in this scenario we could leave provenance and flowfile repos on the local disks (regular application server sizing), but for content could mount a big fat SAN/NAS/you-name-it and configure HDF to point to that.

Are expiration policies configurable per-repository in that case?