Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Suggestions to handle high volume streaming data in NiFi

avatar
Expert Contributor

Hi guys,

I have a use case where we need to load near real-time streaming data into HDFS; incoming data is of high volume, about 1500 messages per second; I've a NiFi dataflow where the ListenTCP processor is ingesting the streaming data, but the requirement is to check the incoming messages for the required structure; so, messages from ListenTCP go to a custom processor that does the structure checking; only messages that have the right structure move forward to MergeContent processor and onto PutHDFS; right now, the validation/check processor became a bottleneck and the backpressure from that processor is causing ListenTCP to queue messages at the source system (the one sending the messages);

Since the message validation processor is not able to handle the incoming data fast enough, I'm thinking that I write the messages from ListenTCP first to the file system and then let the validation processor get the messages from the file system and continue forward. Is this the right approach to resolve this; are there any suggestions for alternatives.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Master Guru

See above comments.

The main issue it to up your JVM memory. If you add 12-16 GB you should be awesome.

If it's a VM environment, give the node 16-32 or more cores. If that's not enough, go to multiple nodes in the cluster.

One node should scale to 10k/sec easy. How big are these files? Anything failing? Errors in the logs?

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html

View solution in original post

10 REPLIES 10

avatar
Master Guru

See above comments.

The main issue it to up your JVM memory. If you add 12-16 GB you should be awesome.

If it's a VM environment, give the node 16-32 or more cores. If that's not enough, go to multiple nodes in the cluster.

One node should scale to 10k/sec easy. How big are these files? Anything failing? Errors in the logs?

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html

avatar
Expert Contributor

Thanks a lot @Timothy Spann I'm going to work with our Admin about the JVM settings and about the # of cores we have.

The flowfiles are small, about 5 KB each or less. ListenTCP processor is throwing these errors - "Internal queue at maximum capacity, could not queue event"; and messages are queuing on the source system side. Below are the memory settings for the ListenTCP that I set.

11377-listentcp-properties.png

avatar
Expert Contributor

Also, I'm in the process of having the Socket buffer (for ListenTCP) increased to 4 MB (the max the Unix admins can change it to).

avatar
Expert Contributor
@Timothy Spann you were right on the money, increasing the JVM memory did the trick for me. Thanks.

avatar
Master Guru

The single biggest performance improvement for ListenTCP will be increasing the "Max Batch Size" from 1 to something like 1000, or maybe even more. The reason is because it will drastically reduce the number of flow files produced by ListenTCP, which will drastically reduce the amount of I/O to the internal NiFi repositories.

The downside is you won't have a single message per flow file anymore so your validation needs to work differently. If you can change your custom processor to stream in a flow file, read each line, and only write out the validated lines to the output stream then it should work well. If the validation processor is still the bottleneck, you could increase the concurrent tasks of this processor slightly so that it keeps up with the batches coming out ListenTCP.

avatar
Expert Contributor

Thanks for the suggestion @Bryan Bende; I'll try that approach, in addition to the JVM and Cores that Timothy mentioned.

I was thinking, if we batch files at ListenTCP, can we not Split them back to individual files with SplitText (the messages are text files) before passing them to our custom processor, minimizing the rework in the custom processor. Do you see any performance hit with this idea?

Thanks.

avatar
Master Guru

You can, but you are just shifting the problem downstream from ListenTCP to SplitText. SplitText now has to produce thousands/millions of flow files that would have been coming out of ListenTCP. It is slightly better though because it gives ListenTCP a chance to keep up with the source.

It would be most efficient to avoid splitting to the individual flow files if possible. Since you are merging things together before HDFS, it shouldn't matter if you are merging many flow files with one message each, or a few flow files containing thousands of messages each.

It just comes down to whether you want to rewrite some of the logic in your custom processor.

avatar
Expert Contributor

thanks for clarifying @Bryan Bende, it makes sense.