Created on 05-16-2023 03:01 AM - edited 05-16-2023 03:17 AM
I have two processes
1. consumerkafkarecord --mergerecord --puthdfs
2. consumerkafkarecord --mergecontent--puthdfs
when I use process 1, I have files in ndfs on the output readable spark, database, python libraries without problems, but the files are not larger than 200mb, all different sizes, although 500mb is set, but it is not filled
when I use process 2 with the same parameters in mb and the number of lines, the files I get are exactly 500mb, but these files do not open, not by spark, not by any database, not by python libraries
question why?
I also want a large file always 500mb and so that it can be read without problems as in process 1
mergecontent settings
mergerecord settings
Created 05-30-2023 02:22 PM
@VLban
MergeContent and MergeRecords handling merging of FlowFiles's content differently. Since your FlowFiles already contain Json formatted record(s), using MergeContent is not going to be the correct processor to use.
MergeContent does not care about the data/content format (except for Avro) of the inbound FlowFiles. With Binary Concatenation, On flowFile's content bytes are simply write starting at the end of the last FlowFile's content. So in the case of JSON, the resulting merged FlowFile's content is not going to be valid json anymore.
Both processors will bin FlowFiles each time the Processor executes based on its run schedule. At the end of each bin cycle the bins are evaluated to see if both configured mins are satisfied. If so, the bin will be merged. Setting a max does not mean that the bin will wait to get merged until the max has been met. So you would be better to set your min to 500 MB if you always want files of at least 500 MB and set you max to a value a bit larger then that. Doing so may result in bins that say have 480 MB binned and next FlowFile can't be added because it would then exceed configured max (FlowFile placed in new bin). So the Max Bin Age property when set will force a bin to merge once the bin has existed for the configured max bin age (this avoid FlowFile getting stuck in these merge based processors).
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Thank you,
Matt
Created 05-30-2023 02:22 PM
@VLban
MergeContent and MergeRecords handling merging of FlowFiles's content differently. Since your FlowFiles already contain Json formatted record(s), using MergeContent is not going to be the correct processor to use.
MergeContent does not care about the data/content format (except for Avro) of the inbound FlowFiles. With Binary Concatenation, On flowFile's content bytes are simply write starting at the end of the last FlowFile's content. So in the case of JSON, the resulting merged FlowFile's content is not going to be valid json anymore.
Both processors will bin FlowFiles each time the Processor executes based on its run schedule. At the end of each bin cycle the bins are evaluated to see if both configured mins are satisfied. If so, the bin will be merged. Setting a max does not mean that the bin will wait to get merged until the max has been met. So you would be better to set your min to 500 MB if you always want files of at least 500 MB and set you max to a value a bit larger then that. Doing so may result in bins that say have 480 MB binned and next FlowFile can't be added because it would then exceed configured max (FlowFile placed in new bin). So the Max Bin Age property when set will force a bin to merge once the bin has existed for the configured max bin age (this avoid FlowFile getting stuck in these merge based processors).
If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.
Thank you,
Matt