Support Questions

tono425 · ‎01-16-2025

Hello,

I' trying to merge over 20000 parquet files into 1 file with MergeRecord processor.
But it generates 7 files.
It doesn't seem to be caused by Bin size, Bin age or number of records limitation.

Looking into the generated file, it looks that similar log records are in the same file. So I assume MergeRecord processor looks into source record content and determine the target file to be merged into.

Is this expected behavior?
I don't set value in Correlation Attribute Name field.

Thanks,

スクリーンショット 2025-01-16 173229.png

MattWho · ‎01-16-2025

@tono425

I assumed all your records were of the same schema. With Merge Record, a bin will consist of potentially many 'like FlowFiles'. In order for two FlowFiles to be considered 'like FlowFiles', they must have the same Schema (as identified by the Record Reader). If a FlowFile is not like other FlowFiles already allocated to a bin, it will be allocated to a different bin. I would still recommend against using min record to 1 since a typical dataflow will be a constant stream of new FlowFiles and the MergeRecord processor will only see those FlowFiles queued at the exact moment of execution. So you can result in smaller then expected number of records in a merge Record with a constantly running dataflow.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

MattWho · ‎01-16-2025

@tono425

When the mergeRecord processor executes it allocates FlowFiles from the inbound connection to bins. at the end of that execution it determines if any of the bins are eligible to be merged. Since you have Minimum number of records set to 1, a bin would merge even if it only had 1 record in it. Understand that Merge processor will not wait for the max setting to be reached.

Try setting your "Min num of records" to 20000 and set your "max bin age" to some value like 5 minutes (The max bin age controls how long Merge record will wait for a bin to reach the set mins before forcing the merge with fewer records when using Bin-Packing Algorithm)

Also be mindful of the number of FlowFile it takes to make up the 20000 records you are trying to merge since a FlowFile can contain 1 too many records in it.

Also keep in mind that if you are running a NiFi Cluster setup, each node can only merge FlowFiles located on the same node. The Merge processor will not merge FlowFiles across nodes.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

tono425 · ‎01-16-2025

@MattWho
Thank you for your advice.
I already tried to increase "Minimum number of records" and "Max Bin Age" but it didn't resolve the problem.
As a test, I placed second MergeRecords next to the first one, and the second processor generated the same number of Outputs as Inputs. So I'm under the impression that MergeRecord checks input records and determine if they can be merged.

When I use MergeContent processor instead of MergeRecord, it generates 1 output file as I expected.
So I am wondering where this difference comes from.

Thanks,

MattWho · ‎01-16-2025

@tono425

I assumed all your records were of the same schema. With Merge Record, a bin will consist of potentially many 'like FlowFiles'. In order for two FlowFiles to be considered 'like FlowFiles', they must have the same Schema (as identified by the Record Reader). If a FlowFile is not like other FlowFiles already allocated to a bin, it will be allocated to a different bin. I would still recommend against using min record to 1 since a typical dataflow will be a constant stream of new FlowFiles and the MergeRecord processor will only see those FlowFiles queued at the exact moment of execution. So you can result in smaller then expected number of records in a merge Record with a constantly running dataflow.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

tono425 · ‎01-17-2025

@MattWho
Thank you for explanation.
Now I understand MergeRecord determines to which file each flowfile to be merged by schema information.
I'll consider increasing "Minimum number of records" as you recommended.

Thanks,

Cloudera Community

Support Questions

MergeRecord generates multiple files

REPORTSMANAGER multiple hprof files are getting ge...

Multiple HPROF files are getting generated with hd...

NiFi MergeRecord change number double- int

CDE Jobs with Multiple CDE Repository Dependencies

Datagen - Data Generator tool built for CDP

How to split large json file into multiple json fi...

Generate multiple FlowFiles with ExecuteScript

Generate a summary from multiple flow files

hive Insert to Dynamic Partition query Generating ...

Multiple HPROF files are getting generated with hd...