Support Questions

Find answers, ask questions, and share your expertise

NiFi: Avro message processing from Kafka

avatar
Expert Contributor

Each incoming Avro messages received from Kafka, will contain the schema in itself. However, when persisting I wanted to be able to group them to a sizeable chunk (say 250MB each file) and persist in HDFS. However, if we combine along with schema the entire file becomes unparsable because of the schema repeats? Is it possible to strip the schema and have a static reference, but instead write only the data content from Avro message?

Can SplitContent processor be used to strip the schema part?

1 ACCEPTED SOLUTION

avatar
Master Guru

The MergeContent processor has a merge type of Avro which will merge together Avro messages that have the same schema. If you are sending in Avro messages with different schemas you will want to use the Correlation Attribute property to only merge messages of the same schema.

View solution in original post

4 REPLIES 4

avatar
Master Guru

The MergeContent processor has a merge type of Avro which will merge together Avro messages that have the same schema. If you are sending in Avro messages with different schemas you will want to use the Correlation Attribute property to only merge messages of the same schema.

avatar
Expert Contributor

As MergeContent is just concatenating the binary content of the files, the resultant Avro file can no longer be parsed because there would be more than one header line with the schema defined. Is there an option/workaround to just strip the schema header from each message binary content, before they can be merged? We have the schema (assume just same schema) in a static file that can be referenced anytime, but wanted the final merged file just to have the content and not the header details with schema.

avatar
Master Guru

When you choose "Merge Format" of "Avro" it is not doing binary concatenation. It is merging all of the Avro files into a new valid Avro file with single header/schema entry. What you described would be "Merge Format" of "Binary Concatenation".

You can see the possible values for "Merge Format" here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/i...

There are also unit tests that show it merging Avro records and then parsing the resulting Avro:

https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-proce...

avatar
Expert Contributor

Thank you. Avro - merge format matches our requirement!