Support Questions

Find answers, ask questions, and share your expertise

Nifi cluster load balance doesn't work well

avatar
New Contributor

Hi, all,

To improve the throughput of nifi, I enlarged my standalone nifi to a three nodes cluster. And, configured my data flow accordingly, now it works but the load balance doesn't work as expected.

The load balancing is configured as round-robin, and it seems that all flow files are dispatched to three nodes at the beginning. However, later I noticed that the cluster slowed down dramatically, with only one node being processing and the other two running nothing. However, I think the unfinished flow files in the queue should be dispatched to other nodes.

192.168.1.200 is the master node and the slowest node in the cluster. My nifi version is 1.12.1,  is it the expected behavior of round robin on this version?

I want to maximize throughput, please tell me how I can do that.

nodes.PNGstatus.PNG

1 ACCEPTED SOLUTION

avatar
Master Mentor

@vystar 

Considering the breaking changes that are part of Apache NiFi 2.0/1, there is considerably more work in preparing for an upgrade to the that new major release.  So I would recommend upgrading to the latest offering in the Apache NiFi 1.x branch.  

You'll want to review all the release notes from 1.13 to the latest release:
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes/#ReleaseNotes-Version1.12.0

You'll want to pay close attention to any mentions of components being moved to optional build profiles.  This means that these nars and the components they contain are no longer include with the Apache NiFi download and if needed must be downloaded form Maven Central and manually added to NiFi.   Deprecated components still exist in the Download, but will not exist in NiFi 2.x releases.

Make sure to maintain a copy of your flow.xml.gz/flow.json.gz (newer releases).  The newer Apache NiFi 1.x load a flow.json.gz instead of the older flow.xml.gz on startup.  However, in the absence of a flow.json.gz and the presence of flow.xml.gz, NiFi 1.x will load from the flow.xml.gz and produce the new flow.json.gz.

After upgrade, you'll still need to review your dataflows.  There are some bad practices that are now blocked by Apache NiFi that may leave some components invalid until manual action is taken to resolve the bad configuration (such as using "primary node" execution on any processor that has an inbound connection).

As far as FlowFile distribution, do it early in your dataflows as possible. Utilize list/fetch still processors instead of get style. (Example: use ListSFTP and FetchSFTP in place of GetSFTP.  This allows you to load-balance the 0 byte listed files before the content is fetched for the files).

Other options like Remote Process Groups can be used (they come with some overhead, but do some target NiFi Cluster load based distribution when dealing with large volumes of FlowFiles.  Not so great for low volumes.).  

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

4 REPLIES 4

avatar
Community Manager

@vystar, Welcome to our community! To help you get the best possible answer, I have tagged our NiFi experts @MattWho @SAMSAL @Shelton who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Mentor

@vystar 

Welcome to the community.

The first observation is your NiFi version being 1.12.1 released 6 years ago.  
There have been a lot of bug fixes and improvements made to load balanced connection since then.  I strongly encourage you to upgrade to much newer version of Apache NiFi. 

Once a NiFi connection has load balanced the FlowFiles in the connection, it will not redistribute them again.  So if your other two nodes receive their round robin distribution and have capacity to process them faster the connection will not round robin the other FlowFiles in the connection left on 1 node again.   Doing so would be very expensive as each node would be trying to redistribute already round robin distributed FlowFiles over and over again.

Maximizing throughput in NiFi often requires looking at all your dataflows, configurations, designs, memory and cpu usage data.  

Is the ExecuteStreamCommand processor the only slow point in your dataflow?
What is it executing? how is it configured?

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt




avatar
New Contributor

@MattWho 

Hi Matt,

I now understand the load balancing behavior, thanks!

My data flow is a typical ETL process. The most heavy work is counting and analyzing time series data, such as the ExecuteStreamCommand processor running a python script to read time series data from a database to perform statistical analysis. And, I configured concurrent tasks to be close to the number of cluster cpu cores.

Currently, I/O and CPU time are about half apart. In the past, it took about 20 minutes for a single node NIFI to process a batch of data received in half an hour, which was close to the limit, and it was difficult to perform more data analysis. I've tried to solve this problem by using redis, scale the database and nifi cluster. It does help, but you can see that the slowest node becomes the bottleneck of the whole system.

Based on the current load balancing logic, if the processing power of three nodes can be estimated, such as 1:2:2, is there any way to distribute data flows to the corresponding nodes according to this ratio?
Also, it might be a good idea to upgrade nifi, I have thousands of processors in my Data Flow, so which version is better to upgrade from nifi 1.12.1, to balance the difficulty of upgrading with the performance of the cluster.

BR,
Sean

avatar
Master Mentor

@vystar 

Considering the breaking changes that are part of Apache NiFi 2.0/1, there is considerably more work in preparing for an upgrade to the that new major release.  So I would recommend upgrading to the latest offering in the Apache NiFi 1.x branch.  

You'll want to review all the release notes from 1.13 to the latest release:
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes/#ReleaseNotes-Version1.12.0

You'll want to pay close attention to any mentions of components being moved to optional build profiles.  This means that these nars and the components they contain are no longer include with the Apache NiFi download and if needed must be downloaded form Maven Central and manually added to NiFi.   Deprecated components still exist in the Download, but will not exist in NiFi 2.x releases.

Make sure to maintain a copy of your flow.xml.gz/flow.json.gz (newer releases).  The newer Apache NiFi 1.x load a flow.json.gz instead of the older flow.xml.gz on startup.  However, in the absence of a flow.json.gz and the presence of flow.xml.gz, NiFi 1.x will load from the flow.xml.gz and produce the new flow.json.gz.

After upgrade, you'll still need to review your dataflows.  There are some bad practices that are now blocked by Apache NiFi that may leave some components invalid until manual action is taken to resolve the bad configuration (such as using "primary node" execution on any processor that has an inbound connection).

As far as FlowFile distribution, do it early in your dataflows as possible. Utilize list/fetch still processors instead of get style. (Example: use ListSFTP and FetchSFTP in place of GetSFTP.  This allows you to load-balance the 0 byte listed files before the content is fetched for the files).

Other options like Remote Process Groups can be used (they come with some overhead, but do some target NiFi Cluster load based distribution when dealing with large volumes of FlowFiles.  Not so great for low volumes.).  

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt