Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi cluster production configuration

avatar
New Contributor

Folks,

 

I have setup the secured nifi cluster on development env and there are few things hitting my mind wrt to production use.

 - There are few configurations related to database, flow repository, content repository, provenance, components etc and I'm wondering what should be the best practices to manage these files. Should I use the persistence volume/storage on K8 to have these centralized for whole cluster.

                    If yes, would it interfere with internal replication? Wouldn't it be SPOF?

                    If not, what if my cluster goes down, I will loose all the state and data or shall I use a replica set explicitly?

 

Can someone helps understanding the best practices on production wrt above scenarios and also anything in general

 

Thanks,

1 ACCEPTED SOLUTION

avatar
Super Mentor

@SachinMehndirat 
There is NO replication of data from the four NiFi repositories across all NiFi nodes in a NiFi cluster.  Each NiFi node in the cluster is only aware of and only excutes against the FlowFile on that specific node.

As such, NiFi nodes can not share a common set of repositories.  Each node must have their own repositories and it is important to protect those repositories from data loss (flowfile_repository and content_repository being most important). 

- flowfile_repository - contain metadata/attributes about FlowFiles actively processing thorugh your NiFi dataflow(s). This includes metadata on location of content of queued FlowFiles.

- content_repository - contains content claims that can hold the content for 1 too many FlowFiles actively being processed or temporarily archived post termination at end of dataflow(s)
- provenance_repository - contains historical lineage information about FlowFile currently or previously processed through your NiFi dataflows.

- database_repository - contains flow configuration history which is a record of changes made via NiFi UI (adding, modifying, deleting, stopping, starting, etc...).  Also contain info about users currently authenticated in to the NiFi node.

Processors that record cluster wide state would use zookeeper to store and retrieve that stored state needed by all nodes.  Processors that use local state will write that state to NiFi locally configured state directory.  So in addition to protect the repositories mentioned above from dataloss, you'll also want to make sure local state (unique to each node in the NiFi cluster) directory is also protected.
The embedded documentation in NiFi for each component has a section "State management:" that will tell you if that component use local and/or cluster state.


You may find some of the info found in the following articles useful:
https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-hi...
https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Arc...
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

View solution in original post

1 REPLY 1

avatar
Super Mentor

@SachinMehndirat 
There is NO replication of data from the four NiFi repositories across all NiFi nodes in a NiFi cluster.  Each NiFi node in the cluster is only aware of and only excutes against the FlowFile on that specific node.

As such, NiFi nodes can not share a common set of repositories.  Each node must have their own repositories and it is important to protect those repositories from data loss (flowfile_repository and content_repository being most important). 

- flowfile_repository - contain metadata/attributes about FlowFiles actively processing thorugh your NiFi dataflow(s). This includes metadata on location of content of queued FlowFiles.

- content_repository - contains content claims that can hold the content for 1 too many FlowFiles actively being processed or temporarily archived post termination at end of dataflow(s)
- provenance_repository - contains historical lineage information about FlowFile currently or previously processed through your NiFi dataflows.

- database_repository - contains flow configuration history which is a record of changes made via NiFi UI (adding, modifying, deleting, stopping, starting, etc...).  Also contain info about users currently authenticated in to the NiFi node.

Processors that record cluster wide state would use zookeeper to store and retrieve that stored state needed by all nodes.  Processors that use local state will write that state to NiFi locally configured state directory.  So in addition to protect the repositories mentioned above from dataloss, you'll also want to make sure local state (unique to each node in the NiFi cluster) directory is also protected.
The embedded documentation in NiFi for each component has a section "State management:" that will tell you if that component use local and/or cluster state.


You may find some of the info found in the following articles useful:
https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-hi...
https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Arc...
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt