Support Questions

Find answers, ask questions, and share your expertise

NiFi node offloading and content storage

avatar
Explorer

Hi, everyone.

We have a three-node Apache NiFi Cluster 1.18.0.
Content archive is enabled (15 minutes deep), content repository is on it's own disk.

Today we noticed that 2nd node use a lot of disk space by content repository (>80%). I've checked archive files and they fill only about 500 Mb disk space. Other space was used by current content data.

Other nodes used content repo as usual -- about 25%.

Because we start to receive next errors and 2nd node performance goes down, I've start 2nd node offloading (~ 13:10 on graph down below):

Unable to write flowfile content to content repository container repo0 due to archive file size constraints; waiting for archive cleanup. Total number of files currently archived = 63

After 2nd node was successfully offloaded I've connect it to the cluster again and now it works fine -- data processing, content repo uses as usual.

But when offloading processed, I've track disk space utilization graphs and noticed that 1st and 3rd node did not utilize disk space proportional that 2nd node frees:

asand3r_0-1752764205847.png

As you can see on screenshot, 2nd node free about 160 Gb of disk space, but 1st got only about +15 Gb and 3rd -- +6 Gb.

Did we loose data? Or other nodes just process in place all data from 2nd while offloading process?

 

1 ACCEPTED SOLUTION

avatar
Master Mentor

@asand3r 

It is important to understand what offloading does and does not do.

  • Offloading is the moving of queued NiFi FlowFiles from the offloading node to other nodes in the NiFi cluster.  A NiFi FlowFile consists of two parts: FlowFile attributes/metadata (stored in the flowfile_repository) and FlowFile content (stored with claims inside the content_repository). A content claim can contain the content for 1 too many individual FlowFiles. 
  • Offloading does not move content claims (what you see in the content_repository) from one node to the other.  It moves FlowFiles, which means it moves the FlowFile attributes and the content for that FlowFile over into FlowFiles on the node it is being transferred to.  

I suspect that the 21GB of content equated to the amount of content currently queued FlowFiles on node 2.   The cumulative size of the active FlowFiles queued on a NiFi node does not equal the cumulative size of active content claims in the content_repository.   Every content claim has a claimant count. This claimant count keeps track of how many FlowFiles have a claim against some bytes of data in the content claim. As a FlowFile reaches a point of termination in a NiFi dataflow, the claimant count on a content claim is decremented. Only once the claimant count reaches 0 on a content claim does that content claim become eligible to be moved to an archive sub-directory and considered for removal under the archive configurations set in the nifi.properties file.   So it is possible for a 1 byte FlowFile still queued on the NiFi canvas to prevent a much larger content_claim from being archived.  So I do not suspect you lost any data during offload.   

1. Are you working with many very small FlowFiles?
2. Are you leaving FlowFiles queued in connections that never get processed through to termination?

You are also correct that your other nodes would have begun processing the transferred FlowFiles as they were being placed in the queues on those nodes.

The large difference in FlowFile count of disk usage between nodes is often the direct result of the dataflow design choices.   I suspect the node 2 was your elected primary node at the time prior to disconnect and offload operations.  The elected primary node is the only node that will schedule "primary node" only configured processors.  So unless you utilize load balanced configurations on the connection following the "primary node" only scheduled ingest processor, all the FlowFiles will be processed only on that primary node in the downstream processors in the dataflow.

Also keep in mind that you are using a fairly old version of Apache NiFi (~4 years old)
The latest and final release of Apache NiFi 1.x line is 1.28. So I recommend upgrading to this latest 1.28 version and start planning for an upgrade to the NiFi 2.x release versions.  

Apache NiFI has also release the Apache NiFi 2.x new major release version.  There is a fair amount of work that must be done before you can migrate from Apache NiFi 1.28 to Apache NiFi 2.x version.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@asand3r 

It is important to understand what offloading does and does not do.

  • Offloading is the moving of queued NiFi FlowFiles from the offloading node to other nodes in the NiFi cluster.  A NiFi FlowFile consists of two parts: FlowFile attributes/metadata (stored in the flowfile_repository) and FlowFile content (stored with claims inside the content_repository). A content claim can contain the content for 1 too many individual FlowFiles. 
  • Offloading does not move content claims (what you see in the content_repository) from one node to the other.  It moves FlowFiles, which means it moves the FlowFile attributes and the content for that FlowFile over into FlowFiles on the node it is being transferred to.  

I suspect that the 21GB of content equated to the amount of content currently queued FlowFiles on node 2.   The cumulative size of the active FlowFiles queued on a NiFi node does not equal the cumulative size of active content claims in the content_repository.   Every content claim has a claimant count. This claimant count keeps track of how many FlowFiles have a claim against some bytes of data in the content claim. As a FlowFile reaches a point of termination in a NiFi dataflow, the claimant count on a content claim is decremented. Only once the claimant count reaches 0 on a content claim does that content claim become eligible to be moved to an archive sub-directory and considered for removal under the archive configurations set in the nifi.properties file.   So it is possible for a 1 byte FlowFile still queued on the NiFi canvas to prevent a much larger content_claim from being archived.  So I do not suspect you lost any data during offload.   

1. Are you working with many very small FlowFiles?
2. Are you leaving FlowFiles queued in connections that never get processed through to termination?

You are also correct that your other nodes would have begun processing the transferred FlowFiles as they were being placed in the queues on those nodes.

The large difference in FlowFile count of disk usage between nodes is often the direct result of the dataflow design choices.   I suspect the node 2 was your elected primary node at the time prior to disconnect and offload operations.  The elected primary node is the only node that will schedule "primary node" only configured processors.  So unless you utilize load balanced configurations on the connection following the "primary node" only scheduled ingest processor, all the FlowFiles will be processed only on that primary node in the downstream processors in the dataflow.

Also keep in mind that you are using a fairly old version of Apache NiFi (~4 years old)
The latest and final release of Apache NiFi 1.x line is 1.28. So I recommend upgrading to this latest 1.28 version and start planning for an upgrade to the NiFi 2.x release versions.  

Apache NiFI has also release the Apache NiFi 2.x new major release version.  There is a fair amount of work that must be done before you can migrate from Apache NiFi 1.28 to Apache NiFi 2.x version.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Explorer

@MattWho thanks for such detailed answer.

You asked do we work with very small files?.. I can't surely tell you, but probably it so -- we often read data from Kafka and Data Provenance page shows that content is rounds near some kilobytes -- always it is almost less than 200 Kb.

The second question about leaving FlowFiles in connections -- yes, it is so.

Also, thanks for your attention to software version, we planning updates. 😃

avatar
Master Mentor

@asand3r 

I have seen many times where user connect "Failure" relationship via connection to another component that is not running allowing failures to queue up in addressed.  These "forever" queued FlowFiles can keep these content claims from being moved to archive and thus result in high content_repository usage.

Another common observation is building test flows that have queued FlowFiles in them.  A The many FlowFiles that can be part of one content claim file can come form anywhere in any dataflow on the canvas.

If you found any of my responses helped you please take a moment to click "Accept as Solution" on them to help our community members.

Thanks,
Matt

avatar
Explorer

Thank you, @MattWho. It's a bit clear for me now.