Created 12-08-2016 09:04 AM
Hi,
How does Infer schema works in the flow, does it keep inferring for every single dataflow, is that a good approach? shoudn't we use convertCSVToAVRO by providing a avsc file created by Kite.
Thanks.
Avijeet
Created 12-08-2016 02:25 PM
Each CSV or JSON that comes in the InferAvroSchema could be different so it will infer the schema for each flow file and put the schema where you specify the schema destination, either flow file content or a flow file attribute. Then you can use that attribute in ConvertCsvToAvro as the schema by referencing ${inferred.avro.schema}.
If you are sending only one type of CSV in to ConvertCsvToAvro then it would be more efficient for you to define the Avro schema you want and not use InferAvroSchema.
Created 12-08-2016 02:25 PM
Each CSV or JSON that comes in the InferAvroSchema could be different so it will infer the schema for each flow file and put the schema where you specify the schema destination, either flow file content or a flow file attribute. Then you can use that attribute in ConvertCsvToAvro as the schema by referencing ${inferred.avro.schema}.
If you are sending only one type of CSV in to ConvertCsvToAvro then it would be more efficient for you to define the Avro schema you want and not use InferAvroSchema.
Created 12-08-2016 02:39 PM
HI @Bryan Bende, Thanks.
will it not be the case when a stream contains messages for one particular schema, I noticed KAFKA is trying to implement something similar, putting a Inferschema in a dataflow seems like a dangerous thing to do.
Created 12-08-2016 03:01 PM
It depends how you construct your dataflow in NiFi... You could set it up so that you have several logical streams that each have their own ConvertCsvToAvro processor, or you could have several processors feeding into the same ConvertCsvToAvro processor.
Kafka itself does not enforce anything related to a schema, but Confluent has a schema registry with serializers and deserializers and they can enforce that any message being written to a topic must conform to the schema for that topic.
Created 12-08-2016 03:17 PM
@Avijeet Dash Take a look at this template for some examples.
Created 12-09-2016 08:02 AM
I have been using InferAvroSchema in dataflows for a while and:
1. It infers the schema for each file on input
2. saves the schema into ${inferred.avro.schema} attribute for that flowfile
3. it is not good for production use
As schema inferrence is only a guess, I would recommend you to infer your schema once (double check manually for correctness) and then use it as a static schema in ConvertAvroTo... processors (prepend RouteOnAttribute if you need different schemas). In production, this is what you want. Sometimes, the data can be misleading for inferrence. For example, I have input CSV with empty column, which in fact is nullable long column. Schema inferrence cannot guess it is nullable long. So for one input file, where the values are filled in as numbers, it guesses long type, and for another, where the column is empty, it guesses nullable string...
Created 12-12-2016 05:04 AM
Hi @Michal Klempa I agree. Thanks.