Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hardware recommendation for HDF/Nifi cluster

avatar
Expert Contributor

Hi All,

- This seems like an obvious question, so forgive me if it is redundant: What hardware configurations would be suitable for setting-up HDF 2.x on VMs for 8 node cluster?

- I found an old document which does help: link

- It seems like Nifi might need more cores vs RAM. My current setup of 12GB/node and 6cores/Node is not working (note: Master has 6GM RAM, which seems like a bottlenect).

- After going-through the link, I am thinking of following , but not sure if this is optimal:

24 cores vs 20GB RAM vs 250-500GB Disk.

Does it seems like an optimal configurations (consider the ratio, more cores vs more RAM?)? To give more context, I currently don't have any specific throughput requirements, and using Nifi for some batch jobs/log processing etc, however I do want to have a stable cluster setup which we could also use in future if the use increases.

Thanks

Obaid

1 ACCEPTED SOLUTION

avatar
Super Mentor
@Obaid Salikeen

The hardware requirements as far as number of CPU and Ram are very dependent upon the nature of the dataflow you design and implement as well as the throughput you want to achieve. While the "Hardware Sizing Recommendations" your linked is a good starting point, I do believe the memory allocations suggested there are low, but again that is subject to your dataflow design intentions. Some processor components are CPU, Disk I/O, Memory, or all of the above intensive. Some of those components may even exhibit different load characteristics depending on how they are configured.

Have you considered using the latest HDF 2.0.1? It gets rid of the NCM (Master) in your cluster. If not, your NCM for an 8 node cluster will likely need more memory for heap then your nodes.

My suggestion would be to do your development work based upon the recommendations above (with additional memory Min 8GB - 16 GB). After you have a designed and tested flow, you will be able to see the impact on your hardware and adjust accordingly for your testing phase before production.

Thanks,

Matt

View solution in original post

3 REPLIES 3

avatar
Super Mentor
@Obaid Salikeen

The hardware requirements as far as number of CPU and Ram are very dependent upon the nature of the dataflow you design and implement as well as the throughput you want to achieve. While the "Hardware Sizing Recommendations" your linked is a good starting point, I do believe the memory allocations suggested there are low, but again that is subject to your dataflow design intentions. Some processor components are CPU, Disk I/O, Memory, or all of the above intensive. Some of those components may even exhibit different load characteristics depending on how they are configured.

Have you considered using the latest HDF 2.0.1? It gets rid of the NCM (Master) in your cluster. If not, your NCM for an 8 node cluster will likely need more memory for heap then your nodes.

My suggestion would be to do your development work based upon the recommendations above (with additional memory Min 8GB - 16 GB). After you have a designed and tested flow, you will be able to see the impact on your hardware and adjust accordingly for your testing phase before production.

Thanks,

Matt

avatar
Expert Contributor

Great, Thanks for your response,

Do you think that there is a relationship with Cores and RAM, meaning if you have X cores then you should have X+ RAM etc, is there any dependency or good practice? We can think of minimum requirements, assuming we will be running a lot of light-weight flows (batch, scheduled).

I mean, more cores will let us run more flows, so just thinking if 32GB RAM will be enough for 20cores if I go for HDF2.x.x. Say in the future all 20cores become busy, then would RAM be an issue?

Thanks

avatar
Super Mentor
@Obaid Salikeen

There is no direct correlation between CPU and heap memory usage. Heap usage is more processor and flow implementation specific. Processors that do things like splitting or merging of FlowFiles can end up using more heap. FlowFile Attributes live in heap memory. NiFi does swap FlowFile attribute to disk per connection based on FlowFile queue count. Default of 20,000 will trigger swapping to start on a connection. But there is no sap threshold based on FlowFile attribute map size. If a user creates large attribute values to FlowFile Attributes, that FlowFile heap usage is going to be higher. You see this isn scenarios where large parts of the FlowFile content is extracted to a FlowFile attribute. So when it comes to heap/memory usage, it comes down to flow design more then any correlation to the number of CPUs.

Thanks,
Matt