Support Questions

Find answers, ask questions, and share your expertise

MergeRecord generates multiple files

avatar
Contributor

Hello,

I' trying to merge over 20000 parquet files into 1 file with MergeRecord processor.
But it generates 7 files.
It doesn't seem to be caused by Bin size, Bin age or number of records limitation.

Looking into the generated file, it looks that similar log records are in the same file. So I assume MergeRecord processor looks into source record content and determine the target file to be merged into.

Is this expected behavior?
I don't set value in Correlation Attribute Name field.

Thanks,

スクリーンショット 2025-01-16 173229.png

1 ACCEPTED SOLUTION

avatar
Master Mentor

@tono425 

I assumed all your records were of the same schema.  With Merge Record, a bin will consist of potentially many 'like FlowFiles'. In order for two FlowFiles to be considered 'like FlowFiles', they must have the same Schema (as identified by the Record Reader).  If a FlowFile is not like other FlowFiles already allocated to a bin, it will be allocated to a different bin.  I would still recommend against using min record to 1 since a typical dataflow will be a constant stream of new FlowFiles and the MergeRecord processor will only see those FlowFiles queued at the exact moment of execution. So you can result in smaller then expected number of records in a merge Record with a constantly running dataflow.  

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@tono425 

When the mergeRecord processor executes it allocates FlowFiles from the inbound connection to bins. at the end of that execution it determines if any of the bins are eligible to be merged.  Since you have Minimum number of records set to 1, a bin would merge even if it only had 1 record in it.  Understand that Merge processor will not wait for the max setting to be reached.  

Try setting your "Min num of records" to 20000 and set your "max bin age" to some value like 5 minutes (The max bin age controls how long Merge record will wait for a bin to reach the set mins before forcing the merge with fewer records when using Bin-Packing Algorithm) 

Also be mindful of the number of FlowFile it takes to make up the 20000 records you are trying to merge since a FlowFile can contain 1 too many records in it.

Also keep in mind that if you are running a NiFi Cluster setup, each node can only merge FlowFiles located on the same node. The Merge processor will not merge FlowFiles across nodes.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Contributor

@MattWho 
Thank you for your advice.
I already tried to increase "Minimum number of records" and "Max Bin Age" but it didn't resolve the problem.
As a test, I placed second MergeRecords next to the first one, and the second processor generated the same number of Outputs as Inputs. So I'm under the impression that MergeRecord checks input records and determine if they can be merged.

When I use MergeContent processor instead of MergeRecord, it generates 1 output file as I expected.
So I am wondering where this difference comes from.

Thanks,

avatar
Master Mentor

@tono425 

I assumed all your records were of the same schema.  With Merge Record, a bin will consist of potentially many 'like FlowFiles'. In order for two FlowFiles to be considered 'like FlowFiles', they must have the same Schema (as identified by the Record Reader).  If a FlowFile is not like other FlowFiles already allocated to a bin, it will be allocated to a different bin.  I would still recommend against using min record to 1 since a typical dataflow will be a constant stream of new FlowFiles and the MergeRecord processor will only see those FlowFiles queued at the exact moment of execution. So you can result in smaller then expected number of records in a merge Record with a constantly running dataflow.  

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Contributor

@MattWho 
Thank you for explanation.
Now I understand MergeRecord determines to which file each flowfile to be merged by schema information.
I'll consider increasing "Minimum number of records" as you recommended.

Thanks,