Support Questions

Find answers, ask questions, and share your expertise

Nifi InferAVROSchema

avatar
Super Collaborator

Hi,

How does Infer schema works in the flow, does it keep inferring for every single dataflow, is that a good approach? shoudn't we use convertCSVToAVRO by providing a avsc file created by Kite.

Thanks.

Avijeet

1 ACCEPTED SOLUTION

avatar
Master Guru

Each CSV or JSON that comes in the InferAvroSchema could be different so it will infer the schema for each flow file and put the schema where you specify the schema destination, either flow file content or a flow file attribute. Then you can use that attribute in ConvertCsvToAvro as the schema by referencing ${inferred.avro.schema}.

If you are sending only one type of CSV in to ConvertCsvToAvro then it would be more efficient for you to define the Avro schema you want and not use InferAvroSchema.

View solution in original post

6 REPLIES 6

avatar
Master Guru

Each CSV or JSON that comes in the InferAvroSchema could be different so it will infer the schema for each flow file and put the schema where you specify the schema destination, either flow file content or a flow file attribute. Then you can use that attribute in ConvertCsvToAvro as the schema by referencing ${inferred.avro.schema}.

If you are sending only one type of CSV in to ConvertCsvToAvro then it would be more efficient for you to define the Avro schema you want and not use InferAvroSchema.

avatar
Super Collaborator

HI @Bryan Bende, Thanks.

will it not be the case when a stream contains messages for one particular schema, I noticed KAFKA is trying to implement something similar, putting a Inferschema in a dataflow seems like a dangerous thing to do.

avatar
Master Guru

It depends how you construct your dataflow in NiFi... You could set it up so that you have several logical streams that each have their own ConvertCsvToAvro processor, or you could have several processors feeding into the same ConvertCsvToAvro processor.

Kafka itself does not enforce anything related to a schema, but Confluent has a schema registry with serializers and deserializers and they can enforce that any message being written to a topic must conform to the schema for that topic.

avatar
Master Guru

@Avijeet Dash Take a look at this template for some examples.

avroschemascenarios.xml

avatar

I have been using InferAvroSchema in dataflows for a while and:

1. It infers the schema for each file on input

2. saves the schema into ${inferred.avro.schema} attribute for that flowfile

3. it is not good for production use

As schema inferrence is only a guess, I would recommend you to infer your schema once (double check manually for correctness) and then use it as a static schema in ConvertAvroTo... processors (prepend RouteOnAttribute if you need different schemas). In production, this is what you want. Sometimes, the data can be misleading for inferrence. For example, I have input CSV with empty column, which in fact is nullable long column. Schema inferrence cannot guess it is nullable long. So for one input file, where the values are filled in as numbers, it guesses long type, and for another, where the column is empty, it guesses nullable string...

avatar
Super Collaborator

Hi @Michal Klempa I agree. Thanks.