Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NiFi MergeContent behavior when Correlation Attribute Name, min # of entries and Max bin age are set

avatar
Expert Contributor

Hi,

Just wanted to know how NiFi's MergeContent processor would behave when all 3 of these properties are set - 1) Correlation Attribute Name, 2) Minimum Number of Entries and 3) Max Bin Age; in my use case, the Correlation Attribute Name is the flowfile date (without the time stamp), since I want to merge files that are from the same day; I set the minimum number of entries, because the flowfiles come at varying intervals throughout the day and I wanted to have similar number of flowfiles in each merged file; at midnight, when the day crosses over to next day, the bin may not get filled because of these 2 criteria (same day files and min # of entries); does Max Bin Age act as an override, when the other 2 conditions are not satisfied, and tell MergeContent to create a merged file with whatever files are in the bin ? I'm thinking yes, but I wanted to confirm.

Thank you.

1 ACCEPTED SOLUTION

avatar
Super Mentor
@Raj B

You can think of the "Max Bin Age" as the trump card. Regardless of any other min criteria being met, the bin will be merged once it reaches this max age. So you assumption is completely correct.

That aside, you need to take heap usage into consideration with this dataflow design you have here. FlowFile attributes (metadata) lives in heap memory space for performance issues. So as you are bining these FlowFiles throughout the day, your JVM heap usage is going to grow and grow. So how many FlowFiles per day are you talking about here?

If you are talking in excess of 10,000 FlowFiles, you may need to adjust your dataflow some. For example use two mergeContent processors back to back. The first merges at lets say a max bin age of 5 minutes. Then the second merges those bundles into a large 24 hour bundle. So 1 new FlowFile is created every 5 minutes and then those 288 merged FlowFiles are merged into a larger FlowFile in the second mergeContent. Doing it this greatly reduces the heap usage. Of course depending on volumes you may need to even merge more often then 5 minutes to achieve optimal heap usage.

Just some food for thought....

Matt

View solution in original post

2 REPLIES 2

avatar
Super Mentor
@Raj B

You can think of the "Max Bin Age" as the trump card. Regardless of any other min criteria being met, the bin will be merged once it reaches this max age. So you assumption is completely correct.

That aside, you need to take heap usage into consideration with this dataflow design you have here. FlowFile attributes (metadata) lives in heap memory space for performance issues. So as you are bining these FlowFiles throughout the day, your JVM heap usage is going to grow and grow. So how many FlowFiles per day are you talking about here?

If you are talking in excess of 10,000 FlowFiles, you may need to adjust your dataflow some. For example use two mergeContent processors back to back. The first merges at lets say a max bin age of 5 minutes. Then the second merges those bundles into a large 24 hour bundle. So 1 new FlowFile is created every 5 minutes and then those 288 merged FlowFiles are merged into a larger FlowFile in the second mergeContent. Doing it this greatly reduces the heap usage. Of course depending on volumes you may need to even merge more often then 5 minutes to achieve optimal heap usage.

Just some food for thought....

Matt

avatar
Expert Contributor

Thanks @Matt for confirming.

Sorry for not clarifying it better, my use case is to merge flow files that are from the same day, but I've both max and min # of entries set to 100, since I want to merge every 100 incoming flow files into a new merged file, as these files are small, less than 5kb each. So, I'm not trying to merge all flowfiles for the day into just 1 file.