I need to extract data from a relational database and load it into S3 bucket. I have a 5 node cluster, and use "GenerateTableFetch" (Primary node) --> "ExecuteSQL" (All nodes) combination to read the data in parallel. I also need to merge extracted data into a single file before loading it into S3, but my "MergeContent" processor produces multiple files in S3. Is there a way to get this done? The full flow looks like this:
"GenerateTableFetch" --> "ExecuteSQL" --> "MergeConent" --> "ConvertAvroToJSON" --> "UpdateAttribute" --> "CompressContent" --> "PutS3Object"
Can you please share more details of your configurations of MergeContent Processor.
Refer to below community links How to configure Merge Content processor.
You need to change Minimum Group Size as per your requirement like (1 B,1 KB,1 MB,1 GB..)
As you can see below configs i changed Minimum Group Size as 10 MB //The minimum size of for the bundle.
let's consider your each flow files size is 1 MB each so the processor will wait until the group size reaches to 10 MB and then bundles all the flowfiles as 1(i.e 10 flowfiles merged as 1 flowfile after merge content processor).
if the flowfiles won't meet the minimum group size requirement then the flowfiles are going to wait before merge content processor until it reaches the minimum group size.
How to force merge flowfiles?
By specifying Max Bin Age property
No matter how many Flowfiles have been assigned to a given bin, that bin will be merged once the bin has existed for this amount of time.
let's consider if i set Max Bin Age property to 10 min and i had only 5 flowfiles having 5 MB over all queue size before merge content processor and our minimum group size property is 10 MB.
The queue will never meet the minimum group size requirement that means flowfiles will be queued for ever there to over come this situation we have added 10 min as max bin age so once the flowfile been in the queue for 10 min then the processor going to merge the flowfiles although they haven't meet the minimum group size requirement also.
About all the other properties in Merge Content processor please refer to the links that i mentioned above answer.
Let me know if you are having any questions..!!
1. Change the amount and delay of the merge.
2. You can add an Enforce Order processor (only one primary node)
3. Make all connections FirstInFirstOutPrioritizer