Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Garbage Collection Likely Interfering w/ Connectivity Between Zookeeper and Nifi Nodes

avatar
Contributor

Hello -

I have a 3 node nifi cluster, running version 1.19.1.

 

I have noticed that consistently there is an interruption of communication between zookeeper and the nifi nodes and each of the nifi nodes disconnect and then re-connect.

 

Another office, is running a previous version of NiFi and they are experiencing the same behavior i.e., nodes consistently disconnect and then reconnect and they have zeroed in on garbage collection as a cause.

 

Based on review of previous tickets in this system, I believe this interruption may be caused by garbage collection.

 

Can you provide any guidance on how to deal with this issue?  Can garbage collection be tuned/reduced so as to have less of an impact?

 

Thank you

1 ACCEPTED SOLUTION

avatar
Super Mentor

@davehkd 
When your nodes become disconnected, a reason will be logged and also most recent events viewable from within the cluster UI via the NIFi interface.   So first question is reason given for node disconnections?  Is it reporting a communication exception with Zookeeper or is it reporting disconnection due to lack of heartbeat (more common).

Within a cluster a node is elected as the cluster coordinator by ZK, the nodes begin sending health and status heartbeats to that cluster coordinator. Default is every 5 seconds.  The elected cluster coordinator expects to receive at least one heartbeat every 8x the configured heartbeat interval, so every 40 seconds.  This is a pretty aggressive setting for NiFi clusters under heavy load or high heap pressure caused by dataflow design.  So first make sure that every node in your cluster has the same configured heartbeat interval value (mixed values will definitely cause lots of node disconnections).  If you are seeing reason for disconnection as lack of heartbeat, adjust the heartbeat interval to 30 seconds.   This means a heartbeat would need to missed in a 4 minutes window instead of 40 seconds.

As far as GC goes, GC is triggered when Java heap utilization gets around ~80%.  How much memory have you configured your NiFi to use?  Setting really high for no reason means would result in longer GC stop-the-world events.   Generally NiFi would be configured with 16 GB to 32 GB for most use cases.
If you find yourself needing more then that , you should take a closer look at your dataflow implementations (dataflows).  The NiFi heap holds many things including the following:
- fllow.json.gz is unpacked and loaded into heap memory on startup.  Flow.json.gz includes everything you have added and configured via the NiFi UI (flows, controller settings, registry clients, templates, etc.). Templates are a deprecated method of creating flow snippets for reuse.  They are held in heap because they are part of the flow.json.gz even though they are not part of any active dataflow.  Downloading for external storage and deleting from within NiFi will reduce heap usage.
- user and groups synced from ldap if using the ldap-user-group-provider.  Shoudl make sure that your have configured filters on this provider so that you are liimiting the number of groups and users to only those the will actually be accessing yoru NiFi.
- FlowFiles are what you see queued between processor components on the UI.  FlowFiles consist of metatdata/attributes about the FlowFile.  NiFi has build in swap settings for how many FlowFiles can exist in a given queue before they start swapping to disk (20,000 set via nifi.queue.swap.threshold in nifi.properties).  Swap files are always 10,000 FlowFiles. By default, a connection has a backpressure object threshold of 10,000.  This means by default a connection will not likely generate a swap file because it is unlikely to reach the swap threshold with these defaults (connection queues are soft limits).  So If you have lots of connection with queued FlowFiles, you will have more heap usage.  Generally speaking, a FlowFile's default metadata attributes amount to very little heap usage, but users can write whatever they want to FlowFile attributes. If you extracting and writing larges amounts of content to FlowFile attributes in yoru dataflow(s), you'll have high heap usage and should be question yourself as to why you are doing this.

- NiFi processor components - Some processors have resource considerations that users should take in to considerations when using those processors.  The embedded documentation within your NiFi will have section for resource considerations under each processor's docs.  Look to see if you are using and with heap/memory consideration.

Often heap usage can be reduced through dataflow design modifications. I hope these details help you dig into your heap usage and helps you make adjustments to improve your cluster stability.

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt





View solution in original post

1 REPLY 1

avatar
Super Mentor

@davehkd 
When your nodes become disconnected, a reason will be logged and also most recent events viewable from within the cluster UI via the NIFi interface.   So first question is reason given for node disconnections?  Is it reporting a communication exception with Zookeeper or is it reporting disconnection due to lack of heartbeat (more common).

Within a cluster a node is elected as the cluster coordinator by ZK, the nodes begin sending health and status heartbeats to that cluster coordinator. Default is every 5 seconds.  The elected cluster coordinator expects to receive at least one heartbeat every 8x the configured heartbeat interval, so every 40 seconds.  This is a pretty aggressive setting for NiFi clusters under heavy load or high heap pressure caused by dataflow design.  So first make sure that every node in your cluster has the same configured heartbeat interval value (mixed values will definitely cause lots of node disconnections).  If you are seeing reason for disconnection as lack of heartbeat, adjust the heartbeat interval to 30 seconds.   This means a heartbeat would need to missed in a 4 minutes window instead of 40 seconds.

As far as GC goes, GC is triggered when Java heap utilization gets around ~80%.  How much memory have you configured your NiFi to use?  Setting really high for no reason means would result in longer GC stop-the-world events.   Generally NiFi would be configured with 16 GB to 32 GB for most use cases.
If you find yourself needing more then that , you should take a closer look at your dataflow implementations (dataflows).  The NiFi heap holds many things including the following:
- fllow.json.gz is unpacked and loaded into heap memory on startup.  Flow.json.gz includes everything you have added and configured via the NiFi UI (flows, controller settings, registry clients, templates, etc.). Templates are a deprecated method of creating flow snippets for reuse.  They are held in heap because they are part of the flow.json.gz even though they are not part of any active dataflow.  Downloading for external storage and deleting from within NiFi will reduce heap usage.
- user and groups synced from ldap if using the ldap-user-group-provider.  Shoudl make sure that your have configured filters on this provider so that you are liimiting the number of groups and users to only those the will actually be accessing yoru NiFi.
- FlowFiles are what you see queued between processor components on the UI.  FlowFiles consist of metatdata/attributes about the FlowFile.  NiFi has build in swap settings for how many FlowFiles can exist in a given queue before they start swapping to disk (20,000 set via nifi.queue.swap.threshold in nifi.properties).  Swap files are always 10,000 FlowFiles. By default, a connection has a backpressure object threshold of 10,000.  This means by default a connection will not likely generate a swap file because it is unlikely to reach the swap threshold with these defaults (connection queues are soft limits).  So If you have lots of connection with queued FlowFiles, you will have more heap usage.  Generally speaking, a FlowFile's default metadata attributes amount to very little heap usage, but users can write whatever they want to FlowFile attributes. If you extracting and writing larges amounts of content to FlowFile attributes in yoru dataflow(s), you'll have high heap usage and should be question yourself as to why you are doing this.

- NiFi processor components - Some processors have resource considerations that users should take in to considerations when using those processors.  The embedded documentation within your NiFi will have section for resource considerations under each processor's docs.  Look to see if you are using and with heap/memory consideration.

Often heap usage can be reduced through dataflow design modifications. I hope these details help you dig into your heap usage and helps you make adjustments to improve your cluster stability.

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt