We want to deploy NiFi in a cluster mode in our production environment and would like to know best practices and guidelines to use NiFi in a large-scale deployment where there are multiple sources and total volume of traffic is in the range of 300 TB. Can someone guide as to what NiFi configuration, disk. I/O, memory usage would be needed to handle that kind of capacity.
here is a sizing guide, which seems to address exactly your questions:
Still i personally wouldn't start with 8Gb RAM per node but at least with 16GB (2 GB per core). Anyway you will have to be clear on the throughput needed (Gb/sec.), not only on the overall volume.
Both answer provided already are good. Let me explain why they are good here:
NiFi is a flow based programing tool. While NiFi's core itself requires very little resources (CPU and Memory) to run, every user of NiFi builds their own unique dataflow(s) on the NiFi canvas which will have their own unique resource impact/requirement.
Even knowing exactly which processors you will be using, how many of each, and volume/rate of data passing through each would not allow anyone to "exactly" calculate the resource footprint of your dataflow(s).
The configuration of these components (processors, connections, controller services, reporting tasks, etc.) and the core (connection swap thresholds, status history retention, etc.) will also impact resource utilization.
It is best to design your dataflow(s) and test the resource impact yourself. NiFi provides some processors like "GenerateFlowFile" which can help you test your flows under load volumes.