Support Questions

Find answers, ask questions, and share your expertise

Error with nifi accumulation in one file

avatar
Contributor

I have two processes
1. consumerkafkarecord --mergerecord  --puthdfs
2. consumerkafkarecord --mergecontent--puthdfs

when I use process 1, I have files in ndfs on the output readable spark, database, python libraries without problems, but the files are not larger than 200mb, all different sizes, although 500mb is set, but it is not filled

when I use process 2 with the same parameters in mb and the number of lines, the files I get are exactly 500mb, but these files do not open, not by spark, not by any database, not by python libraries

question why?
I also want a large file always 500mb and so that it can be read without problems as in process 1

mergecontent settings

Merge Strategy Bin-Packing Algorithm
Merge Format Binary Concatenation
Attribute Strategy Keep Only Common Attributes
Correlation Attribute Name No value set
Minimum Number of Entries 10000
Maximum Number of Entries 1000000
Minimum Group Size 100 MB
Maximum Group Size 500 MB
Max Bin Age No value set
Maximum number of Bins 10
Delimiter Strategy Text
Header No value set
Footer No value set
Demarcator \n

mergerecord settings

Record Reader JsonTreeReader
Record Writer ParquetRecordSetWriter
Merge Strategy Bin-Packing Algorithm
Correlation Attribute Name No value set
Attribute Strategy Keep Only Common Attributes
Minimum Number of Records 10000
Maximum Number of Records 1000000
Minimum Bin Size 100 MB
Maximum Bin Size 500 MB
Max Bin Age No value set
Maximum Number of Bins 10
1 ACCEPTED SOLUTION

avatar
Master Mentor

@VLban 

MergeContent and MergeRecords handling merging of FlowFiles's content differently.  Since your FlowFiles already contain Json formatted record(s), using MergeContent is not going to be the correct processor to use. 

MergeContent does not care about the data/content format (except for Avro) of the inbound FlowFiles.  With Binary Concatenation, On flowFile's content bytes are simply write starting at the end of the last FlowFile's content. So in the case of JSON, the resulting merged FlowFile's content is not going to be valid json anymore.

 

Both processors will bin FlowFiles each time the Processor executes based on its run schedule.  At the end of each bin cycle the bins are evaluated to see if both configured mins are satisfied.  If so, the bin will be merged.  Setting a max does not mean that the bin will wait to get merged until the max has been met.  So you would be better to set your min to 500 MB if you always want files of at least 500 MB and set you max to a value a bit larger then that.  Doing so may result in bins that say have 480 MB binned and next FlowFile can't be added because it would then exceed configured max (FlowFile placed in new bin).  So the Max Bin Age property when set will force a bin to merge once the bin has existed for the configured max bin age (this avoid FlowFile getting stuck in these merge based processors).  

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

View solution in original post

1 REPLY 1

avatar
Master Mentor

@VLban 

MergeContent and MergeRecords handling merging of FlowFiles's content differently.  Since your FlowFiles already contain Json formatted record(s), using MergeContent is not going to be the correct processor to use. 

MergeContent does not care about the data/content format (except for Avro) of the inbound FlowFiles.  With Binary Concatenation, On flowFile's content bytes are simply write starting at the end of the last FlowFile's content. So in the case of JSON, the resulting merged FlowFile's content is not going to be valid json anymore.

 

Both processors will bin FlowFiles each time the Processor executes based on its run schedule.  At the end of each bin cycle the bins are evaluated to see if both configured mins are satisfied.  If so, the bin will be merged.  Setting a max does not mean that the bin will wait to get merged until the max has been met.  So you would be better to set your min to 500 MB if you always want files of at least 500 MB and set you max to a value a bit larger then that.  Doing so may result in bins that say have 480 MB binned and next FlowFile can't be added because it would then exceed configured max (FlowFile placed in new bin).  So the Max Bin Age property when set will force a bin to merge once the bin has existed for the configured max bin age (this avoid FlowFile getting stuck in these merge based processors).  

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt