Created 11-01-2016 07:32 AM
Each incoming Avro messages received from Kafka, will contain the schema in itself. However, when persisting I wanted to be able to group them to a sizeable chunk (say 250MB each file) and persist in HDFS. However, if we combine along with schema the entire file becomes unparsable because of the schema repeats? Is it possible to strip the schema and have a static reference, but instead write only the data content from Avro message?
Can SplitContent processor be used to strip the schema part?
Created 11-01-2016 01:12 PM
The MergeContent processor has a merge type of Avro which will merge together Avro messages that have the same schema. If you are sending in Avro messages with different schemas you will want to use the Correlation Attribute property to only merge messages of the same schema.
Created 11-01-2016 01:12 PM
The MergeContent processor has a merge type of Avro which will merge together Avro messages that have the same schema. If you are sending in Avro messages with different schemas you will want to use the Correlation Attribute property to only merge messages of the same schema.
Created 11-01-2016 10:14 PM
As MergeContent is just concatenating the binary content of the files, the resultant Avro file can no longer be parsed because there would be more than one header line with the schema defined. Is there an option/workaround to just strip the schema header from each message binary content, before they can be merged? We have the schema (assume just same schema) in a static file that can be referenced anytime, but wanted the final merged file just to have the content and not the header details with schema.
Created 11-01-2016 10:52 PM
When you choose "Merge Format" of "Avro" it is not doing binary concatenation. It is merging all of the Avro files into a new valid Avro file with single header/schema entry. What you described would be "Merge Format" of "Binary Concatenation".
You can see the possible values for "Merge Format" here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/i...
There are also unit tests that show it merging Avro records and then parsing the resulting Avro:
Created 11-02-2016 03:45 AM
Thank you. Avro - merge format matches our requirement!