Support Questions

VLban · ‎05-16-2023

I have two processes
1. consumerkafkarecord --mergerecord --puthdfs
2. consumerkafkarecord --mergecontent--puthdfs

when I use process 1, I have files in ndfs on the output readable spark, database, python libraries without problems, but the files are not larger than 200mb, all different sizes, although 500mb is set, but it is not filled

when I use process 2 with the same parameters in mb and the number of lines, the files I get are exactly 500mb, but these files do not open, not by spark, not by any database, not by python libraries

question why?
I also want a large file always 500mb and so that it can be read without problems as in process 1

mergecontent settings

Merge Strategy Bin-Packing Algorithm

Merge Format Binary Concatenation

Attribute Strategy Keep Only Common Attributes

Correlation Attribute Name No value set

Minimum Number of Entries 10000

Maximum Number of Entries 1000000

Minimum Group Size 100 MB

Maximum Group Size 500 MB

Max Bin Age No value set

Maximum number of Bins 10

Delimiter Strategy Text

Header No value set

Footer No value set

Demarcator \n

mergerecord settings

Record Reader JsonTreeReader

Record Writer ParquetRecordSetWriter

Merge Strategy Bin-Packing Algorithm

Correlation Attribute Name No value set

Attribute Strategy Keep Only Common Attributes

Minimum Number of Records 10000

Maximum Number of Records 1000000

Minimum Bin Size 100 MB

Maximum Bin Size 500 MB

Max Bin Age No value set

Maximum Number of Bins 10

MattWho · ‎05-30-2023

@VLban

MergeContent and MergeRecords handling merging of FlowFiles's content differently. Since your FlowFiles already contain Json formatted record(s), using MergeContent is not going to be the correct processor to use.

MergeContent does not care about the data/content format (except for Avro) of the inbound FlowFiles. With Binary Concatenation, On flowFile's content bytes are simply write starting at the end of the last FlowFile's content. So in the case of JSON, the resulting merged FlowFile's content is not going to be valid json anymore.

Both processors will bin FlowFiles each time the Processor executes based on its run schedule. At the end of each bin cycle the bins are evaluated to see if both configured mins are satisfied. If so, the bin will be merged. Setting a max does not mean that the bin will wait to get merged until the max has been met. So you would be better to set your min to 500 MB if you always want files of at least 500 MB and set you max to a value a bit larger then that. Doing so may result in bins that say have 480 MB binned and next FlowFile can't be added because it would then exceed configured max (FlowFile placed in new bin). So the Max Bin Age property when set will force a bin to merge once the bin has existed for the configured max bin age (this avoid FlowFile getting stuck in these merge based processors).

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

View solution in original post

MattWho · ‎05-30-2023

@VLban

MergeContent and MergeRecords handling merging of FlowFiles's content differently. Since your FlowFiles already contain Json formatted record(s), using MergeContent is not going to be the correct processor to use.

MergeContent does not care about the data/content format (except for Avro) of the inbound FlowFiles. With Binary Concatenation, On flowFile's content bytes are simply write starting at the end of the last FlowFile's content. So in the case of JSON, the resulting merged FlowFile's content is not going to be valid json anymore.

Both processors will bin FlowFiles each time the Processor executes based on its run schedule. At the end of each bin cycle the bins are evaluated to see if both configured mins are satisfied. If so, the bin will be merged. Setting a max does not mean that the bin will wait to get merged until the max has been met. So you would be better to set your min to 500 MB if you always want files of at least 500 MB and set you max to a value a bit larger then that. Doing so may result in bins that say have 480 MB binned and next FlowFile can't be added because it would then exceed configured max (FlowFile placed in new bin). So the Max Bin Age property when set will force a bin to merge once the bin has existed for the configured max bin age (this avoid FlowFile getting stuck in these merge based processors).

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

Cloudera Community

Support Questions

Error with nifi accumulation in one file

Nifi: Parse Error for Xml file with 2 doubled tags

Decompressing nested ZIP files in NiFi

NIFI-Heap Accumulation Issue

NiFi Error Handling - Design Pattern

Read SAS files into parquet using nifi

Counting lines in text files with NiFi - part 2

Using PartitionRecord (GrokReader/JSONWriter) to P...

HDF/NiFi to convert row-formatted text files to co...

Ignore first line of a file and process second lin...

Decyphering error messages in Apache NiFi