Support Questions

vg27 · ‎03-06-2025

HI All,

we configured three node nifi cluster by name node1. node2 and node3, with external zookeeper z1,z2, & Z3, in a Azure Virtual machine,
Now Require to add the Storage resources to these nifi nodes, we are planning for below storage to choose, But require your help in choose to best fitment.

1. Azure Blob Storage

2. Azure NFS Storage

3. Azure external storage (Standard HDD, Standard SSD)

our use case to store the below storage,

a. Nifi Prevenance Registry

b. Nifi content Registry

c. Nifi flow file Registry

Request you to suggest us best fit in these case.

Data expected to grow 3TB to 5TB.

MattWho · ‎03-11-2025

@vg27

You plan on retaining a lot of provenance data?
I don't know you daily expected volumes/sizes, but 100GB for NiFi Content Repository seems a bit small.

Since each node runs its own copy of the flow.json.gz and has its own repositories, you can't replicate the repositories between nodes. In your scenario the primary node change happens when a restart of your NiFi cluster occurs, but in reality a primary node change can happen at other times as well. The cluster coordinator has nothing to do with which node is running the primary node only scheduled processor components.

I am also trying to understand why have a NiFi cluster setup if you only ever intent to have the primary node do all the work?

I really don't follow your use case here?

Your plan is to ingest data into NiFi's primary node and hold it within the Dataflows built on the NiFi canvas? How do you plan to do that (have it all queue up in some connection until someone starts the downstream processor)?

When NiFi is started, it loads the flow.json.gz into NiFi heap memory, It loads the flowFile repository local to the node into heap memory (except any swap files), and each node continues processing those FlowFiles through the dataflows. So a change to which node is elected the primary node has not impact on above. A change in elected primary node only impacts the few processors (only those with no inbound connection can be configured for primary node only scheduling) that are configured for primary node only scheduling. So lets say node1 is current primary node and has ingested data in to FlowFiles that are now queued in a connection. Then some event occurs that results in a node2 now being elected as primary node. All the FlowFiles originally ingested by node1 are still on node 1 and continue to be processed through the dataflows on node1. Node 2 is now the primary node and thus the only node now scheduling the "primary node" scheduled processors which are now be processed through the dataflows on node2.

Please help our community grow and thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

MattWho · ‎03-06-2025

@vg27

When it comes to NiFi's content, FlowFile, and provenance repositories, it is about performance.

FlowFile repository contain the attributes/metadata for a FlowFile. This includes what content claim in the content repository and byte offset contains the content for a FlowFile. The contents of this repository typically remains relatively small. Usage is in direct correlation with number of FlowFiles actively queued in the NiFi UI and the size of the attributes on the FlowFile. So size can quickly grow if you build dataflows the extract content from the FlowFiles into FlowFile attributes. FlowFile attributes are read/written by every processor that touches the FlowFile.
Content repository contain the content claims referenced by the FlowFiles. Each content claim can hold the content for 1 too many FlowFiles. A content claim is only moved to archive and then eligible for deletion ONLY once no FlowFiles reference any content in the claim. So a one byte FlowFile left queued in some connection on the NiFi UI can prevent a large content claim from being deleted. Content is only read by processor that need to read that content (some processor only need access to the FlowFiles metadata.attributes).
Provenance repository hold events about the life of a FlowFiles through your NiFi dataflows from create to delete. NiFi can produce a lot of provenance events depending on FlowFile volume and number of NiFi processor components a FlowFile passes through. Since provenance events are not a required part of processor your FlowFiles, you have complete control over retention setting and how much disk space they can consume. Loss of this repo, does not result in any dataloss.

Since all three of these repos have constant I/O using NFS storage or standard HDD would not be my first recommendation. (NFS storage relies on network I/O and Standard HDD probably are going to create a performance bottleneck for your data volumes). I am not that familiar with the performance characteristics of Azure blob storage to make a recommendation there. SSD are good choice, but make sure there is data protection for your content and FlowFile repositories. You don't want disk failure to result in data loss.

I am not clear on this "Data expected to grow 3TB to 5TB."
Is that per hour, per day, etc... Is it spread evenly over the day or comes at specific heavy times each day. Take this into consideration when selecting based on storage throughput performance.

Please help our community grow and thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

vg27 · ‎03-07-2025

HI @MattWho

Thanks for the update,

Thanks for the update, we will be going to use the Premium SSD as per your suggestion,
Now coming to storage side we would be going to attach 500GB of SSD in three nifinodes cluster for the below partition.

Provenance repository: which will be Partition for 300GB

Content repository: which will be Partition for 100GB
FlowFile repository : which will be Partition of 100GB

By assuming nifinode1 is the cluster coordinator and writes the data into node1 for one week, when we restart the cluster in second week Niinode2 is elected as cluster coordinator, and so on for node3 in next week, Then how the data will be read and retrieve from the node1. in next cycle...?
whether we need to enable replication between the SSD nodes of three virtual machines or any other process we need to use..?

can you please suggest here

MattWho · ‎03-11-2025

@vg27

You plan on retaining a lot of provenance data?
I don't know you daily expected volumes/sizes, but 100GB for NiFi Content Repository seems a bit small.

Since each node runs its own copy of the flow.json.gz and has its own repositories, you can't replicate the repositories between nodes. In your scenario the primary node change happens when a restart of your NiFi cluster occurs, but in reality a primary node change can happen at other times as well. The cluster coordinator has nothing to do with which node is running the primary node only scheduled processor components.

I am also trying to understand why have a NiFi cluster setup if you only ever intent to have the primary node do all the work?

I really don't follow your use case here?

Your plan is to ingest data into NiFi's primary node and hold it within the Dataflows built on the NiFi canvas? How do you plan to do that (have it all queue up in some connection until someone starts the downstream processor)?

When NiFi is started, it loads the flow.json.gz into NiFi heap memory, It loads the flowFile repository local to the node into heap memory (except any swap files), and each node continues processing those FlowFiles through the dataflows. So a change to which node is elected the primary node has not impact on above. A change in elected primary node only impacts the few processors (only those with no inbound connection can be configured for primary node only scheduling) that are configured for primary node only scheduling. So lets say node1 is current primary node and has ingested data in to FlowFiles that are now queued in a connection. Then some event occurs that results in a node2 now being elected as primary node. All the FlowFiles originally ingested by node1 are still on node 1 and continue to be processed through the dataflows on node1. Node 2 is now the primary node and thus the only node now scheduling the "primary node" scheduled processors which are now be processed through the dataflows on node2.

Please help our community grow and thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

vg27 · ‎03-11-2025

HI Matt,

Thanks for all your valuable inputs, we configured the SAN disk to nifi cluster based on the above recommendations.

Regards

Girish V G

Cloudera Community

Support Questions

Nifi storage setup in Azure