Support Questions

Find answers, ask questions, and share your expertise

Read binary avro data from Kafka using Nifi

Explorer

I want to be able to read binary Avro data from Kafka using ConsumeKafka Processor. And I am able to do that, but the content type of the file is in "application/octet-stream". Not able to view it as well, it says "No viewer is registered for this content type". I am not even able to convert this avro data to json, since the content type is octet-stream. But when i use "kafka-avro-console-consumer" on the console the data is in json. How to get this json data into Nifi?

1 ACCEPTED SOLUTION

Master Collaborator

@syntax_ ,

 

It seems to me that your source Kafka is a Confluent Kafka cluster and the producer uses schema registry to source the schema from it. In this case, the KafkaAvroSerializer prepends 5 bytes to every message produced to indicate the id of the schema that was used (in you case, schema id 34). If you try to read this message as a pure Avro payload the deserialization will fail because those 5 bytes are not part of the Avro payload.

 

So, the best way to handle this in NiFi is to also use Schema Registry to deserialize Avro messages. With this, NiFi will get the schema ID from the message 5-byte prefix, use that ID to retrieve the correct schema from Schema Registry and then correctly deserialize the Avro payload.

 

Considering that my guess is correct and you're using a Confluent Schema Registry, you should create a new ConfluentSchemaRegistry controller service and configure it with the details of your Schema Registry. Once this is done, edit the configuration of the Avro Reader controller service and set the following:

araujo_0-1662355081260.png

 

After you do this, your flow should be able to correctly process the messages that you're reading from Kafka.

 

I read the binary message that you send me with NiFi and loaded the schema in my local schema registry service (making sure it got assigned the right ID 34), and I was able to successfully convert the message from Avro to JSON using a ConvertRecord processor:

araujo_1-1662355234277.png

araujo_2-1662355261958.png

 

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

16 REPLIES 16

Master Collaborator

@syntax_ ,

 

I believe you have a schema that you can use to parse your Avro data, right?

Instead of using ConsumeKafka, use the ConsumeKafkaRecord processor. In that processor specify an Record Reader of type AvroReader and provide the correct schema so that the reader can properly deserialize your data.

 

If you want to convert the data for JSON, you can then specify a JsonRecordSetWriter as the Record Writer for that processor, so that the output flowfiles will be in that format and you'll be able to inspect the content of the queues.

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

I used the ConsumeKafkaRecord processor as well. I gave RecordReader as Avro reader. But when I run it, it gives an error saying "invalidMagicException: Not an Avro data file". Is there something I am missing here?

Master Collaborator

Would you be able to save one of these files in a file and share it with me?

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

Yes sure, do you need the data that i am sending to Nifi or the data that has been processed by nifi. 

Master Collaborator

If you could provide both it would help.

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

Here is the output for "kafka-avro-console-consumer --from-beginning --bootstrap-server admin:9092 --topic pbs_jobs"

 

{"comment":"Job run at Thu Aug 18 at 04:21 on (node01:ncpus=1)","timestamp":1660963434048,"job_state":"Job Completed","host":"admin","job_id":"65.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:22 on (node01:ncpus=1)","timestamp":1660963434048,"job_state":"Job Completed","host":"admin","job_id":"66.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:22 on (node01:ncpus=1)","timestamp":1660963434048,"job_state":"Job Completed","host":"admin","job_id":"67.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:23 on (node02:ncpus=10)","timestamp":1660963434048,"job_state":"Job Completed","host":"admin","job_id":"68.admin","job_user":"root","job_group":"root"}
{"comment":"Not Running: Insufficient amount of resource: ncpus (R: 10 A: 1 T: 16)","timestamp":1660963434048,"job_state":"Job Queued","host":"admin","job_id":"69.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 07:13 on (node02:ncpus=1)","timestamp":1660963494042,"job_state":"Job Completed","host":"admin","job_id":"70.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:16 on (node01:ncpus=1)","timestamp":1660963494042,"job_state":"Job Completed","host":"admin","job_id":"64.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:21 on (node01:ncpus=1)","timestamp":1660963494042,"job_state":"Job Completed","host":"admin","job_id":"65.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:22 on (node01:ncpus=1)","timestamp":1660963494042,"job_state":"Job Completed","host":"admin","job_id":"66.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:22 on (node01:ncpus=1)","timestamp":1660963494042,"job_state":"Job Completed","host":"admin","job_id":"67.admin","job_user":"root","job_group":"root"}
{"comment":"Job run at Thu Aug 18 at 04:23 on (node02:ncpus=10)","timestamp":1660963494042,"job_state":"Job Completed","host":"admin","job_id":"68.admin","job_user":"root","job_group":"root"}

 

 

 

But when I send this data to nifi using ConsumeKafkaRecord: 

Screenshot (277).png

 

Getting this as output: 

syntax__0-1662030243376.png

 

Master Collaborator

I actually wanted to have a look at the binary Avro data that is in Kafka, not the deserialized content.

Something like this:

kafka-console-consumer --from-beginning --bootstrap-server admin:9092 --topic pbs_jobs --max-messages 1 > message.avro

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

Okay, this is the output. 

syntax__0-1662101229322.png

 

It is in this format

"dJob run at Thu Aug 18 at 04:16 on (node01:ncpus=1)▒▒đ▒`Job Completed
admin64.admirooroot

Master Collaborator

Can you please send me that file in a private message. Copy and paste won't work 🙂

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

I tried sending it, but there seems to be no option to send a file. Can you please help me on that?
There is no insert option. Only the insert link is there. Sorry new to this website.  

Master Collaborator

@syntax_ ,

 

Please try running this command: xxd message.avro

Then you can copy and paste the output here.

 

Cheers,

André

 

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

Here is the output:


00000000: 0000 0000 2264 4a6f 6220 7275 6e20 6174 ...."dJob run at
00000010: 2054 6875 2041 7567 2031 3820 6174 2030 Thu Aug 18 at 0
00000020: 343a 3136 206f 6e20 286e 6f64 6530 313a 4:16 on (node01:
00000030: 6e63 7075 733d 3129 c4f0 c491 d760 1a4a ncpus=1).....`.J
00000040: 6f62 2043 6f6d 706c 6574 6564 0a61 646d ob Completed.adm
00000050: 696e 1036 342e 6164 6d69 6e08 726f 6f74 in.64.admin.root
00000060: 0872 6f6f 740a .root.

Master Collaborator

Thanks, do you have the Avro schema for it?

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

Yes. 

{
"name": "PbsJobData",
"type": "record",
"namespace": "pbsjob",
"fields": [
{
"name": "comment",
"type": "string",
"doc": "comment associated with the pbs job"
},
{
"name": "timestamp",
"type": "long",
"logicalType": "timestamp-millis",
"doc": "timestamp of scheduler metric"
},
{
"name": "job_state",
"type": "string",
"doc": "job status"
},
{
"name": "host",
"type": "string",
"doc": "Name of the PBS scheduler server"
},
{
"name": "job_id",
"type": "string",
"doc": "The job ID assigned by PBS"
},
{
"name": "job_user",
"type": "string",
"doc": "The job user"
},
{
"name": "job_group",
"type": "string",
"doc": "The job group"
}
]
}

Master Collaborator

@syntax_ ,

 

It seems to me that your source Kafka is a Confluent Kafka cluster and the producer uses schema registry to source the schema from it. In this case, the KafkaAvroSerializer prepends 5 bytes to every message produced to indicate the id of the schema that was used (in you case, schema id 34). If you try to read this message as a pure Avro payload the deserialization will fail because those 5 bytes are not part of the Avro payload.

 

So, the best way to handle this in NiFi is to also use Schema Registry to deserialize Avro messages. With this, NiFi will get the schema ID from the message 5-byte prefix, use that ID to retrieve the correct schema from Schema Registry and then correctly deserialize the Avro payload.

 

Considering that my guess is correct and you're using a Confluent Schema Registry, you should create a new ConfluentSchemaRegistry controller service and configure it with the details of your Schema Registry. Once this is done, edit the configuration of the Avro Reader controller service and set the following:

araujo_0-1662355081260.png

 

After you do this, your flow should be able to correctly process the messages that you're reading from Kafka.

 

I read the binary message that you send me with NiFi and loaded the schema in my local schema registry service (making sure it got assigned the right ID 34), and I was able to successfully convert the message from Avro to JSON using a ConvertRecord processor:

araujo_1-1662355234277.png

araujo_2-1662355261958.png

 

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Explorer

Yes, it works. Thank you so much!!

 

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.