Support Questions

Find answers, ask questions, and share your expertise

Best practices for Nifi clusters installation

avatar
Expert Contributor

Hi,

I have two servers (Mermoy 8G, Disk 146G and CPU4) for the HDF 2.1.2, I followed the following best practice provided in the below link https://community.hortonworks.com/content/kbentry/7882/hdfnifi-best-practices-for-setting-up-a-high-...

I would like to know if there is any more best practices that i need to follow?

Also how /var, /log, etc folders should be sizing based on the total allocated disk.

Thanks,

SJ

1 ACCEPTED SOLUTION

avatar
Master Mentor
@Sanaz Janbakhsh

In addition to the guide you mentioned, I strongly recommend you avoid using the embedded zookeeper for your NiFi cluster. HDF 2.x is highly dependent on ZK for things like cluster coordinator, primary node elections, and cluster state management. NiFi itself can put a considerable strain on your server's resources itself. So for cluster stability reasons you should configure your NIfi cluster to use an external ZK.

The Largest repo should always be your content repo (Hold content of current files being processed as well as any archived data). FlowFile Repo is your most crucial repo (Corruption of this repo equals data loss). You can control how much disk space is used by Provenance. (How much you need depends on size of your dataflow, volume of data, and number of events you want to be able to retain). Database repo also stays relatively small (Flow configuration history and user DB exists there)

As far as log directory goes, again this is highly dependent on your log retention policies and the amount of logging you have enabled in NiFi.

Since sizing of most of the above is dependent on the complexity of your dataflow, data volumes, and data sizes, it would be impossible to say your system should have x size disks. The suggested approach is to setup a development/test environment where you can model your dataflow and volumes and then use that as input to your sizing requirements for your production environment.

Thanks, Matt

View solution in original post

4 REPLIES 4

avatar
Master Mentor
@Sanaz Janbakhsh

In addition to the guide you mentioned, I strongly recommend you avoid using the embedded zookeeper for your NiFi cluster. HDF 2.x is highly dependent on ZK for things like cluster coordinator, primary node elections, and cluster state management. NiFi itself can put a considerable strain on your server's resources itself. So for cluster stability reasons you should configure your NIfi cluster to use an external ZK.

The Largest repo should always be your content repo (Hold content of current files being processed as well as any archived data). FlowFile Repo is your most crucial repo (Corruption of this repo equals data loss). You can control how much disk space is used by Provenance. (How much you need depends on size of your dataflow, volume of data, and number of events you want to be able to retain). Database repo also stays relatively small (Flow configuration history and user DB exists there)

As far as log directory goes, again this is highly dependent on your log retention policies and the amount of logging you have enabled in NiFi.

Since sizing of most of the above is dependent on the complexity of your dataflow, data volumes, and data sizes, it would be impossible to say your system should have x size disks. The suggested approach is to setup a development/test environment where you can model your dataflow and volumes and then use that as input to your sizing requirements for your production environment.

Thanks, Matt

avatar
Expert Contributor

Thanks Matt for the complete advise. I understood that sizing depends on the complexity of the data flow but I'm wondering if there is any percentage relation between those files so I canfollow for example 10% to

Provenance and 50% to content or some kind of this calculation for the /var and /logs , etc.

avatar
Master Mentor
@Sanaz Janbakhsh

Unfortunately a formula for what percentage of your disk should be allocated to each repo does not exist and would frankly be impossible to establish considering so many dynamic inputs come in to play. But to establish a staring point from which to adjust from, I would suggest the following:

10% - 15% --> FlowFile Repository

5% - 10% --> Database repository

50% - 60% --> Content Repository

? --> Provenance Repository (Depends on your retention policies, but Provenance repo size can be set to a restricted size in Nifi configs. Default is 1 GB disk usage or 24 hours. Soft limits so it may temporarily exceed the size threshold until clean-up occurs, so don't set size to exact size of partition it is configured to use.)

10%- 15% --> /logs (This is very subjective as well. How much log history do you need to retain? What default log levels have you set? While the /logs directory may stay relatively small during good times, an outage can result in a logging explosion. Consider a downstream system outage. All NiFi processors that are trying to push data to that downstream system will be producing ERROR logs during that time.)

The above assumes your OS and applications are installed on a different disk. If not you will need to adjust accordingly.

Thanks,

Matt

avatar
Expert Contributor

Thanks a lot Matt. It is really helpful.