Created 12-26-2017 05:57 PM
I need to extract data from a relational database and load it into S3 bucket. I have a 5 node cluster, and use "GenerateTableFetch" (Primary node) --> "ExecuteSQL" (All nodes) combination to read the data in parallel. I also need to merge extracted data into a single file before loading it into S3, but my "MergeContent" processor produces multiple files in S3. Is there a way to get this done? The full flow looks like this:
"GenerateTableFetch" --> "ExecuteSQL" --> "MergeConent" --> "ConvertAvroToJSON" --> "UpdateAttribute" --> "CompressContent" --> "PutS3Object"
Created 12-27-2017 02:38 PM
Thank you - after some tweaking and tuning of the parameters you mentioned I was able to achieve desired results.
Alex
Created 12-27-2017 01:01 AM
Can you please share more details of your configurations of MergeContent Processor.
Refer to below community links How to configure Merge Content processor.
Created on 12-27-2017 03:10 AM - edited 08-18-2019 02:46 AM
You need to change Minimum Group Size as per your requirement like (1 B,1 KB,1 MB,1 GB..)
Example:-
As you can see below configs i changed Minimum Group Size as 10 MB //The minimum size of for the bundle.
let's consider your each flow files size is 1 MB each so the processor will wait until the group size reaches to 10 MB and then bundles all the flowfiles as 1(i.e 10 flowfiles merged as 1 flowfile after merge content processor).
if the flowfiles won't meet the minimum group size requirement then the flowfiles are going to wait before merge content processor until it reaches the minimum group size.
How to force merge flowfiles?
By specifying Max Bin Age property
No matter how many Flowfiles have been assigned to a given bin, that bin will be merged once the bin has existed for this amount of time.
let's consider if i set Max Bin Age property to 10 min and i had only 5 flowfiles having 5 MB over all queue size before merge content processor and our minimum group size property is 10 MB.
The queue will never meet the minimum group size requirement that means flowfiles will be queued for ever there to over come this situation we have added 10 min as max bin age so once the flowfile been in the queue for 10 min then the processor going to merge the flowfiles although they haven't meet the minimum group size requirement also.
About all the other properties in Merge Content processor please refer to the links that i mentioned above answer.
Let me know if you are having any questions..!!
Created 12-27-2017 01:38 AM
Created 12-27-2017 01:47 AM
1. Change the amount and delay of the merge.
2. You can add an Enforce Order processor (only one primary node)
3. Make all connections FirstInFirstOutPrioritizer
Created 12-27-2017 02:38 PM
Thank you - after some tweaking and tuning of the parameters you mentioned I was able to achieve desired results.
Alex
Created on 12-06-2024 04:40 AM - edited 12-06-2024 05:34 AM
I was having similar issue, where I had around 7000+ flow files and I needed to merge them into one. Provided that, I didn't know what will be the exact number of flow files, I couldn't specify that number in Minimum Number of Entries attribute of mergeContent processor to make it only 1 flow file after merge.
After trying a lot of things, including trying to count the flow files or induce a delay before sending them off to merge, I finally found the solution on the mergeContent documentation below :
And there you go ! setting 'Merge Strategy' property to 'Defragment' and 'Attribute Strategy' property to 'Keep Only Common Attributes' fixes the problem for me.
In all my flow files, they have common attributes and are now merged into a single flow file.
The original question is from 2017, but I hope it helps out somebody else looking for an answer.
You can check the original answer here https://stackoverflow.com/questions/56356164/nifi-merging-all-of-necessary-flowfile-in-one-shot-with...