Support Questions

matej_puntar · ‎10-03-2017

I am using a Nifi cluster of 2 x c4.2xlarge machines (8 cores and 15 GB memory each)

Nifi is setup to use 12GB of memory
# JVM memory settings
java.arg.2=-Xms12g
java.arg.3=-Xmx12g

jsonToAvro processor is running with 7 Concurrent Tasks and I get a throughput of 450 messages per second. Message size is about 3KB. The only slow part is the jsonToAvro processor. When running the workflow all cores are above 90%

If I save data to file from kafka and use orc-tools to convert to ORC file I get a throughput of 5000 msg/sec on one machine.

I configured NiFi as instructed in the Best practices articel: https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.h...

What am I doing wrong?

Thank you.

bbende · ‎10-03-2017

You could improve the performance significantly by using the record-oriented capabilities introduced in Apache NiFi 1.2.0...

You would use ConsumeKafkaRecord_0_10 with a JsonTreeReader and an AvroRecordSetWriter and set the batch size to something like 1000 (or more). This would produce 1 flow file coming out of ConsumeKafkaRecord_0_10 that already has the Avro records in it, then you could eliminate the need for ConvertJSONToAvro, and possibly eliminate MergeContent since you will already have a bunch of records in a flow file.

View solution in original post

bbende · ‎10-03-2017

You could improve the performance significantly by using the record-oriented capabilities introduced in Apache NiFi 1.2.0...

You would use ConsumeKafkaRecord_0_10 with a JsonTreeReader and an AvroRecordSetWriter and set the batch size to something like 1000 (or more). This would produce 1 flow file coming out of ConsumeKafkaRecord_0_10 that already has the Avro records in it, then you could eliminate the need for ConvertJSONToAvro, and possibly eliminate MergeContent since you will already have a bunch of records in a flow file.

matej_puntar · ‎10-03-2017

Thank you for the suggestion. This looks very promising. I just need to figure out how the suggested components work. I will let you know how it goes.
Thank you.

TimothySpann · ‎10-03-2017

just a schema is need to do those very easy see

https://community.hortonworks.com/articles/138632/data-flow-enrichment-with-nifi-lookuprecord-proces...

https://community.hortonworks.com/articles/106450/record-oriented-data-with-nifi.html

https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi

I used QueryRecord to convert so I could limit what I wanted to see:

https://community.hortonworks.com/articles/118132/minifi-capturing-converting-tensorflow-inception-t...

https://community.hortonworks.com/articles/130814/sensors-and-image-capture-and-deep-learning-analys...

matej_puntar · ‎10-04-2017

By using ConsumeKafkaRecord_0_10 with JsonTreeReader and an AvroRecordSetWriter Like Bryan suggested I now get a throughput of 9600 msg/sec on the cluster (4800 msg/sec on each machine).

I could not remove the MergeContent. If I do I get very small files cca. 0.5MB.

Thank you

matej_puntar · ‎10-04-2017

Processor ConvertAvroToORC was using only 2 Concurrent Tasks although it was configured to use 4. After restarting the cluster ConvertAvroToORC started using 4 Concurrent Tasks and the throughput is now 14600 msg/sec on the cluster (7300 msg/sec on each machine).

Cloudera Community

Support Questions

NiFi converting json from Kafka to columnar ORC files - jsonToAvro very slow