Support Questions

Find answers, ask questions, and share your expertise

NiFi high jvm heap utilization on primary node

avatar
Rising Star

Hi there,

Maybe you guys can shed a light on an issue we’ve been having since we went live with one of our flows using alot of processors running on the primary node. 

We are using a 2 node cluster where we run a couple of getsqs (time driven, 1min) and executesql (cron scheduling, spread out during the day both on primary node. We do this so only one task is ran resulting in just on trigger flow file. Other wise we get 2 sqs events for the same event and 2 executesqls queries.

What we have been seeing is that the jvm heap utilization on the primary node is above 70%, this increased gradually the past few days from 25%, so didn’t peak straight to 70%. The secondary node on the other hand is on a stable below 10%. So quite clear that there is a correlation between the processors running only on primary and the heap used.

Our questions are:
1. When a primary node processor triggers, will the flow file be processed by the following processors in the flow only on the primary node? Or can this also be done by the secondary? Other processors are running on all nodes.

2. Could executesql on cron cause high heap utilization? Cron’s run spread out during the day and volumes are low. Around 1300 a day.

3. Could getsqs cause high heap utilization? We get a lot of events, where we filter out later in the flow which ones we need and terminate those not needed. 8k events spread out during the day where we only process about 3k of these. We are working on finetuning the sqs events so we only receive the one that really needs to be processed.

Hopefully you can give advise on the challenge we are having.

By the way, when restarting the primary node, the secondary becomes primary and have the same heap issue there.

Thank you in advanced.

Kind regards,

Dave

2 ACCEPTED SOLUTIONS

avatar
Master Mentor

@Dave0x1 

Some general related information:

1. Java uses heap as needed, but for efficiency does not run Garbage Collection (GC) to free unused heap until typically over 80% allocated heap usage. So not unexpected to see heap utilization of 70% even once data is processed out of your dataflow(s).  So there is nothing unexpected or alarming specific to that 70% heap utilization in itself.  You probably want to look at the GC events (partial and full GC) to see how many and how often they are happening.  What are your current XMS and XMX heap setting for your NiFi.  Heap is requested during execution of NiFi components.  NiFi does not manage the heap or its clean-up, that is a process handled by Java.

2.  When a component is configured with "primary node" execution, it will only be scheduled on the currently elected primary node.  The FlowFiles generated will then only exist on the primary node unless you design into your dataflow(s) redistribution (typically done via load balance configuration on downstream connection of the primary node execution processor component) of those FlowFiles across all your nodes for further downstream processing.  Even with distribution, there will be some deviations in resource usage since you are still doing some additional work on just the primary node.

3. The primary node and cluster coordinator nodes are elected by Zookeeper (ZK) and can change.  Commonly there is some event that triggers a change (current primary node stops heart-beating to ZK, current primary node disconnects from cluster, cluster or primary node is restarted, current primary node shutdown, etc..  You could look at the individual node events in the cluster UI to see when the primary node change to see if aligns with any of these event types.   But even with a primary node change that would not shift heap usage to another node. 

While i see nothing of concern with what was shared in your post, the things you want to watch for is memory related logs of concern.   Java out of memory (OOM) alerts indicate a problem that must be addressed. OOM can happen when your designed dataflow(s) try to consume more memory then is allocated to your JVM.  Or is a sign that GC can keep up with the memory demand.  Or heap usage exceeded 80% utilization and GC run was unable to free enough unused heap to get back below that 80% utilization.  While not out of memory, this indicates your dataflow(s) use high active heap (common offenders are merge or split based processors with excessively high number of FlowFiles being merged in a single transaction or a single split producing an excessively large number of output split FlowFiles in a single transaction.  The embedded documentation (usage docs) for the various components indicate if a component has the potential high heap or high CPU usage in the "System Resource Consideration" section.  

Here is example form MergeContent:

MattWho_0-1718371912131.png

Hope you find this information useful for your query.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

avatar
Master Mentor

@Dave0x1 
Typically MergeContent processor will utilize a lot of heap when the number of FlowFiles being merged in a single execution is very high and/or the size of the FlowFile's attributes are very large.  While FlowFiles queued in a connection will have the FlowFile attributes/metadata held in NiFi heap, there is a swap threshold at which time NiFi swaps FlowFile attributes to disk.  When it comes to MergeContent, FlowFile are allocated to bins (will still show in inbound connection count).  FlowFiles allocated to bin(s) can not be swapped.  So if you set min/max num flowfiles or min/max size to a large value, it would result in large amounts of heap usage.  

Note: FlowFile content is not held in heap by mergeContent.

So the way to create very large merged files while keeping heap usage lower is by chaining multiple mergeContent processor together in series.  So you merge a batch of FlowFiles in first MergeContent and then merge those into larger merged FlowFile in a second MergeContent.

Also be mindful of extracting content to FlowFile attributes or generating FlowFile attributes with large values to help minimize heap usage.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt



View solution in original post

8 REPLIES 8

avatar
Rising Star

Here’s some extra info:

jvm - high utilization is the primary node

Dave0x1_0-1718363140189.png

core load is normal

Dave0x1_1-1718363183897.png

 

avatar
Rising Star

Correction, the primary node switched itself and the high heap utilization stays on the original node. So, the problem doesn’t seem to stay on the current primary node.

avatar
Super Guru

Hi @Dave0x1 ,

Not sure if this is related but if you are using  releases 2.0.0 M1\M2  and deploying python extensions please see this: https://community.cloudera.com/t5/Support-Questions/Apache-Nifi-Release-2-0-M1-amp-M2-High-CPU-Utili...

 

avatar
Rising Star

Hi there @SAMSAL , we are using version 1.24 and no python scripting. We’re using all standaard processors; updateattribute, routeonattribute, logattribute and some specials like, getsqs, executesql, lookupattribute and some groovy script for creating custom metrics.

We’ve just restarted the node with high heap. Now averaging at 10%. We’ll be monitoring the progress this weekend and get back on the results monday. Hopefully the restart helped out.

Thank you for now and have a nice weekend 🙂 

avatar
Master Mentor

@Dave0x1 

Some general related information:

1. Java uses heap as needed, but for efficiency does not run Garbage Collection (GC) to free unused heap until typically over 80% allocated heap usage. So not unexpected to see heap utilization of 70% even once data is processed out of your dataflow(s).  So there is nothing unexpected or alarming specific to that 70% heap utilization in itself.  You probably want to look at the GC events (partial and full GC) to see how many and how often they are happening.  What are your current XMS and XMX heap setting for your NiFi.  Heap is requested during execution of NiFi components.  NiFi does not manage the heap or its clean-up, that is a process handled by Java.

2.  When a component is configured with "primary node" execution, it will only be scheduled on the currently elected primary node.  The FlowFiles generated will then only exist on the primary node unless you design into your dataflow(s) redistribution (typically done via load balance configuration on downstream connection of the primary node execution processor component) of those FlowFiles across all your nodes for further downstream processing.  Even with distribution, there will be some deviations in resource usage since you are still doing some additional work on just the primary node.

3. The primary node and cluster coordinator nodes are elected by Zookeeper (ZK) and can change.  Commonly there is some event that triggers a change (current primary node stops heart-beating to ZK, current primary node disconnects from cluster, cluster or primary node is restarted, current primary node shutdown, etc..  You could look at the individual node events in the cluster UI to see when the primary node change to see if aligns with any of these event types.   But even with a primary node change that would not shift heap usage to another node. 

While i see nothing of concern with what was shared in your post, the things you want to watch for is memory related logs of concern.   Java out of memory (OOM) alerts indicate a problem that must be addressed. OOM can happen when your designed dataflow(s) try to consume more memory then is allocated to your JVM.  Or is a sign that GC can keep up with the memory demand.  Or heap usage exceeded 80% utilization and GC run was unable to free enough unused heap to get back below that 80% utilization.  While not out of memory, this indicates your dataflow(s) use high active heap (common offenders are merge or split based processors with excessively high number of FlowFiles being merged in a single transaction or a single split producing an excessively large number of output split FlowFiles in a single transaction.  The embedded documentation (usage docs) for the various components indicate if a component has the potential high heap or high CPU usage in the "System Resource Consideration" section.  

Here is example form MergeContent:

MattWho_0-1718371912131.png

Hope you find this information useful for your query.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Rising Star

Hi Matt, thank you for the extensive reply. This is a lot to think about. We’ll go thru this monday morning with the team to see if we can get those metrics on our dashboards 🤓 There was a mergecontent temporarily on the canvas to gather analysis data, which might have caused this issue. This template has already been removed yesterday. 

@SAMSAL @MattWho thank you for your replies. I’ll get back to you monday on our findings.

avatar
Rising Star

Update from our side, from the looks of things, stopping and removing the mergecontent used to create csv files has solved the issue regarding the jvm heap. We will watch the MEMORY resource on processors when implementing new stuff.

Thank you all for the great advices and fast replies! 

avatar
Master Mentor

@Dave0x1 
Typically MergeContent processor will utilize a lot of heap when the number of FlowFiles being merged in a single execution is very high and/or the size of the FlowFile's attributes are very large.  While FlowFiles queued in a connection will have the FlowFile attributes/metadata held in NiFi heap, there is a swap threshold at which time NiFi swaps FlowFile attributes to disk.  When it comes to MergeContent, FlowFile are allocated to bins (will still show in inbound connection count).  FlowFiles allocated to bin(s) can not be swapped.  So if you set min/max num flowfiles or min/max size to a large value, it would result in large amounts of heap usage.  

Note: FlowFile content is not held in heap by mergeContent.

So the way to create very large merged files while keeping heap usage lower is by chaining multiple mergeContent processor together in series.  So you merge a batch of FlowFiles in first MergeContent and then merge those into larger merged FlowFile in a second MergeContent.

Also be mindful of extracting content to FlowFile attributes or generating FlowFile attributes with large values to help minimize heap usage.

Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt