I am maintaining a log of processed serial numbers in a file on disk. Now if I run the flow again I needed to match the serial number attribute on each NiFi flowfile with the list of serial numbers stored in a file on disk.
1)I am fetching the file containing a list of serial numbers on disk using FetchFile processor.
2)Using routeonContent processor to route any serial number not present in the file further by searching for the serial number attribute in the file.
3) I auto terminate the matched ones since its already processed.
This Procedure needs me to read the file containing the list of serial numbers for every flowflie which in turn causes the memory to bloat. Is there a way to read the disk file only once and match all flow file attributes to the same "static" instance of the disk file?
You may want to look in to using the detectDuplicate processor and a cache server to check for duplicate serial numbers.
Perhaps a flow that extracts the serial number from an incoming data and place that serial number in an attribute on the FlowFile. Then that FlowFile is passed to a detectDuplicate processor which checks the serial number via the newly created FlowFile attributes against a distributed Cache server. If serial number is already found, FlowFile is deleted as duplicate, if it is a new serial number, it is added to the cache and FlowFile routed further along in dataflow. If you still wanted to maintain a file on disk of all the serial numbers, you could take the serial number from the non-duplicate FlowFiles an append them to the file on disk just as you are now.
If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.