Created on 05-10-2017 03:04 PM - edited 08-17-2019 01:02 PM
Prior to release 1.2, Apache NiFi supports the notion of pluggable "Schema Registry”, and ConvertXToY processors including but not limited to handling format conversions between CSV, JSON and Avro. Starting from 1.2, NiFi supports a new concept - an abstraction of “record”. Any FlowFile can contain one or more record objects of a given format and schema. NiFi then offers a series of controller services which provide a ‘RecordReader' interface and will have immediate support for Avro, JSON Tree, JSON Path, Grok, and CSV as provided source formats. in the same 1.2 release, NiFi will support a series of controller services which provide a “RecordWriter" interface and will have immediate support for Avro, JSON, Text, CSV.
Each of the Reader and Writer objects will support interaction with another Controller Service which will be referencing a schema registry for retrieving schemas when necessary. An Avro schema registry and an HWX schema registry will be immediately available in Apache NiFi 1.2. To use the Avro schema registry, a user needs to provide the actual schema when configuring the “AvroSchemaRegistry” controller service, and the schema information will be written to the flow.xml file. Another alternative is to use the HWX schema registry, which provides a centralized place to store reusable schemas, as well as storing the compatibility info between different schema versions.
To further understand the record based operation mechanism, think of a CSV file with 100 rows and 5 fields, a record orientated processor will be able to read the CSV bytes row by row (leveraging a Record Reader CS), deserialize the CSV bytes into 100 record objects, and store those record objects in memory. and the same processor can potentially leverage a Record Writer CS to serialize the in-memory record objects into the specified output format. On top of all that, a “QueryRecord” processor allows users to define free-form SQL queries to be executed against the in-memory record objects, and be able to map SQL results to one or more output relationships.
Now let’s take a closer look at a use case below, in which I am trying to configure a record based ConsumeKafka processor.
use ‘schema name’ property