Created 03-06-2025 05:56 AM
HI All,
we configured three node nifi cluster by name node1. node2 and node3, with external zookeeper z1,z2, & Z3, in a Azure Virtual machine,
Now Require to add the Storage resources to these nifi nodes, we are planning for below storage to choose, But require your help in choose to best fitment.
1. Azure Blob Storage
2. Azure NFS Storage
3. Azure external storage (Standard HDD, Standard SSD)
our use case to store the below storage,
a. Nifi Prevenance Registry
b. Nifi content Registry
c. Nifi flow file Registry
Request you to suggest us best fit in these case.
Data expected to grow 3TB to 5TB.
Created 03-11-2025 05:58 AM
@vg27
You plan on retaining a lot of provenance data?
I don't know you daily expected volumes/sizes, but 100GB for NiFi Content Repository seems a bit small.
Since each node runs its own copy of the flow.json.gz and has its own repositories, you can't replicate the repositories between nodes. In your scenario the primary node change happens when a restart of your NiFi cluster occurs, but in reality a primary node change can happen at other times as well. The cluster coordinator has nothing to do with which node is running the primary node only scheduled processor components.
I am also trying to understand why have a NiFi cluster setup if you only ever intent to have the primary node do all the work?
I really don't follow your use case here?
Your plan is to ingest data into NiFi's primary node and hold it within the Dataflows built on the NiFi canvas? How do you plan to do that (have it all queue up in some connection until someone starts the downstream processor)?
When NiFi is started, it loads the flow.json.gz into NiFi heap memory, It loads the flowFile repository local to the node into heap memory (except any swap files), and each node continues processing those FlowFiles through the dataflows. So a change to which node is elected the primary node has not impact on above. A change in elected primary node only impacts the few processors (only those with no inbound connection can be configured for primary node only scheduling) that are configured for primary node only scheduling. So lets say node1 is current primary node and has ingested data in to FlowFiles that are now queued in a connection. Then some event occurs that results in a node2 now being elected as primary node. All the FlowFiles originally ingested by node1 are still on node 1 and continue to be processed through the dataflows on node1. Node 2 is now the primary node and thus the only node now scheduling the "primary node" scheduled processors which are now be processed through the dataflows on node2.
Please help our community grow and thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 03-06-2025 06:27 AM
@vg27
When it comes to NiFi's content, FlowFile, and provenance repositories, it is about performance.
Since all three of these repos have constant I/O using NFS storage or standard HDD would not be my first recommendation. (NFS storage relies on network I/O and Standard HDD probably are going to create a performance bottleneck for your data volumes). I am not that familiar with the performance characteristics of Azure blob storage to make a recommendation there. SSD are good choice, but make sure there is data protection for your content and FlowFile repositories. You don't want disk failure to result in data loss.
I am not clear on this "Data expected to grow 3TB to 5TB."
Is that per hour, per day, etc... Is it spread evenly over the day or comes at specific heavy times each day. Take this into consideration when selecting based on storage throughput performance.
Please help our community grow and thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 03-07-2025 08:45 AM
HI @MattWho
Thanks for the update,
Thanks for the update, we will be going to use the Premium SSD as per your suggestion,
Now coming to storage side we would be going to attach 500GB of SSD in three nifinodes cluster for the below partition.
Provenance repository: which will be Partition for 300GB
Content repository: which will be Partition for 100GB
FlowFile repository : which will be Partition of 100GB
By assuming nifinode1 is the cluster coordinator and writes the data into node1 for one week, when we restart the cluster in second week Niinode2 is elected as cluster coordinator, and so on for node3 in next week, Then how the data will be read and retrieve from the node1. in next cycle...?
whether we need to enable replication between the SSD nodes of three virtual machines or any other process we need to use..?
can you please suggest here
Created 03-11-2025 05:58 AM
@vg27
You plan on retaining a lot of provenance data?
I don't know you daily expected volumes/sizes, but 100GB for NiFi Content Repository seems a bit small.
Since each node runs its own copy of the flow.json.gz and has its own repositories, you can't replicate the repositories between nodes. In your scenario the primary node change happens when a restart of your NiFi cluster occurs, but in reality a primary node change can happen at other times as well. The cluster coordinator has nothing to do with which node is running the primary node only scheduled processor components.
I am also trying to understand why have a NiFi cluster setup if you only ever intent to have the primary node do all the work?
I really don't follow your use case here?
Your plan is to ingest data into NiFi's primary node and hold it within the Dataflows built on the NiFi canvas? How do you plan to do that (have it all queue up in some connection until someone starts the downstream processor)?
When NiFi is started, it loads the flow.json.gz into NiFi heap memory, It loads the flowFile repository local to the node into heap memory (except any swap files), and each node continues processing those FlowFiles through the dataflows. So a change to which node is elected the primary node has not impact on above. A change in elected primary node only impacts the few processors (only those with no inbound connection can be configured for primary node only scheduling) that are configured for primary node only scheduling. So lets say node1 is current primary node and has ingested data in to FlowFiles that are now queued in a connection. Then some event occurs that results in a node2 now being elected as primary node. All the FlowFiles originally ingested by node1 are still on node 1 and continue to be processed through the dataflows on node1. Node 2 is now the primary node and thus the only node now scheduling the "primary node" scheduled processors which are now be processed through the dataflows on node2.
Please help our community grow and thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.
Thank you,
Matt
Created 03-11-2025 09:36 PM
HI Matt,
Thanks for all your valuable inputs, we configured the SAN disk to nifi cluster based on the above recommendations.
Regards
Girish V G