Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

NiFi converting json from Kafka to columnar ORC files - jsonToAvro very slow

New Contributor

I am using a Nifi cluster of 2 x c4.2xlarge machines (8 cores and 15 GB memory each)

Nifi is setup to use 12GB of memory
# JVM memory settings
java.arg.2=-Xms12g
java.arg.3=-Xmx12g

jsonToAvro processor is running with 7 Concurrent Tasks and I get a throughput of 450 messages per second. Message size is about 3KB. The only slow part is the jsonToAvro processor. When running the workflow all cores are above 90%

If I save data to file from kafka and use orc-tools to convert to ORC file I get a throughput of 5000 msg/sec on one machine.

I configured NiFi as instructed in the Best practices articel: https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.h...

What am I doing wrong?

Thank you.

40623-jsontoavro-slow.png

1 ACCEPTED SOLUTION

You could improve the performance significantly by using the record-oriented capabilities introduced in Apache NiFi 1.2.0...

You would use ConsumeKafkaRecord_0_10 with a JsonTreeReader and an AvroRecordSetWriter and set the batch size to something like 1000 (or more). This would produce 1 flow file coming out of ConsumeKafkaRecord_0_10 that already has the Avro records in it, then you could eliminate the need for ConvertJSONToAvro, and possibly eliminate MergeContent since you will already have a bunch of records in a flow file.

View solution in original post

5 REPLIES 5

You could improve the performance significantly by using the record-oriented capabilities introduced in Apache NiFi 1.2.0...

You would use ConsumeKafkaRecord_0_10 with a JsonTreeReader and an AvroRecordSetWriter and set the batch size to something like 1000 (or more). This would produce 1 flow file coming out of ConsumeKafkaRecord_0_10 that already has the Avro records in it, then you could eliminate the need for ConvertJSONToAvro, and possibly eliminate MergeContent since you will already have a bunch of records in a flow file.

New Contributor

Thank you for the suggestion. This looks very promising. I just need to figure out how the suggested components work. I will let you know how it goes.
Thank you.

New Contributor

By using ConsumeKafkaRecord_0_10 with JsonTreeReader and an AvroRecordSetWriter Like Bryan suggested I now get a throughput of 9600 msg/sec on the cluster (4800 msg/sec on each machine).

I could not remove the MergeContent. If I do I get very small files cca. 0.5MB.

Thank you

New Contributor

Processor ConvertAvroToORC was using only 2 Concurrent Tasks although it was configured to use 4. After restarting the cluster ConvertAvroToORC started using 4 Concurrent Tasks and the throughput is now 14600 msg/sec on the cluster (7300 msg/sec on each machine).

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.