Support Questions
Find answers, ask questions, and share your expertise

How to Transform an Avro to a Super-Schema Avro?

Assume a data file in avro schema "avro schema short".

Assume "avro schema full" that is inclusive of the "avro schema short" and has some additional fields, set for default values.

The short schema Avro can have fields in any order and field-wise is a sub-set of the super-schema with fields not necessarily in the same order.

How would one use NiFi out-of-box processors to transform the first data file into "avro schema full" setting the values for the additional fields to the default values as specified in the "avro schema full"? It could be a creative solution involving one of those Execute ... or Scripted ... processors.

1 ACCEPTED SOLUTION

Accepted Solutions

I agree with what Matt said above and I had been working on a template to achieve this before I saw his answer so figured I would post it anyway...

I made up the following two schemas:

{"name": "shortSchema",  
"namespace": "nifi",  
"type": "record",  
"fields": [  
 { "name": "a", "type": "string" },  
 { "name": "b", "type": "string" }  
]}
{"name": "fullSchema",  
"namespace": "nifi",  
"type": "record",  
"fields": [  
 { "name": "c", "type": ["null", "string"], "default" : "default value for field c" },  
 { "name": "d", "type": ["null", "string"], "default" : "default value for field d" },  
 { "name": "a", "type": "string" },  
 { "name": "b", "type": "string" }  
]}

Then made the following flow:

16116-convertrecordprocessors.png

16115-convertrecordservices.png

What this flow does it the following:

  1. Generate a CSV with two rows and the columns a,b,c
  2. Reads the CSV using the shortSchema and writes as Avro with the shortSchema
  3. Updates the schema.name attribute to fullSchema
  4. Reads the Avro using the embedded schema (shortSchema) and writes it using the schema from schema.name (fullSchema)
  5. Reads the Avro using the embedded schema (fullSchema) and writes it back to CSV

At the end the CSV that is printed now has the new fields with default values filled in.

The CSV part is just for being able to easily see what is going on and is obviously not required if you already have Avro data.

Here is a template of the flow: convert-avro-short-to-full.xml

View solution in original post

3 REPLIES 3

Super Guru

As of NiFi 1.2.0, you can use ConvertRecord, after configuring an AvroReader with the short schema and an AvroRecordSetWriter with the super schema.

Prior to NiFi 1.2.0, you may be able to use ConvertAvroSchema, using the super-schema as both Input and Output Schema property values (if you use the short schema as Input, the processor complains about the unmapped fields in the super schema). I tried this by adding a single field to my "short schema" to make the "super schema":

{"name": "extra_field", "type": "string", "default": "Hello"}

I'm not sure if this will work with arbitrary super-sets, but it is worth a try 🙂

@Matt Burgess

Thank you so much.

I agree with what Matt said above and I had been working on a template to achieve this before I saw his answer so figured I would post it anyway...

I made up the following two schemas:

{"name": "shortSchema",  
"namespace": "nifi",  
"type": "record",  
"fields": [  
 { "name": "a", "type": "string" },  
 { "name": "b", "type": "string" }  
]}
{"name": "fullSchema",  
"namespace": "nifi",  
"type": "record",  
"fields": [  
 { "name": "c", "type": ["null", "string"], "default" : "default value for field c" },  
 { "name": "d", "type": ["null", "string"], "default" : "default value for field d" },  
 { "name": "a", "type": "string" },  
 { "name": "b", "type": "string" }  
]}

Then made the following flow:

16116-convertrecordprocessors.png

16115-convertrecordservices.png

What this flow does it the following:

  1. Generate a CSV with two rows and the columns a,b,c
  2. Reads the CSV using the shortSchema and writes as Avro with the shortSchema
  3. Updates the schema.name attribute to fullSchema
  4. Reads the Avro using the embedded schema (shortSchema) and writes it using the schema from schema.name (fullSchema)
  5. Reads the Avro using the embedded schema (fullSchema) and writes it back to CSV

At the end the CSV that is printed now has the new fields with default values filled in.

The CSV part is just for being able to easily see what is going on and is obviously not required if you already have Avro data.

Here is a template of the flow: convert-avro-short-to-full.xml

View solution in original post