Support Questions

SachinMehndirat · ‎01-12-2023

Folks,

I have setup the secured nifi cluster on development env and there are few things hitting my mind wrt to production use.

- There are few configurations related to database, flow repository, content repository, provenance, components etc and I'm wondering what should be the best practices to manage these files. Should I use the persistence volume/storage on K8 to have these centralized for whole cluster.

If yes, would it interfere with internal replication? Wouldn't it be SPOF?

If not, what if my cluster goes down, I will loose all the state and data or shall I use a replica set explicitly?

Can someone helps understanding the best practices on production wrt above scenarios and also anything in general

Thanks,

MattWho · ‎01-12-2023

@SachinMehndirat
There is NO replication of data from the four NiFi repositories across all NiFi nodes in a NiFi cluster. Each NiFi node in the cluster is only aware of and only excutes against the FlowFile on that specific node.

As such, NiFi nodes can not share a common set of repositories. Each node must have their own repositories and it is important to protect those repositories from data loss (flowfile_repository and content_repository being most important).

- flowfile_repository - contain metadata/attributes about FlowFiles actively processing thorugh your NiFi dataflow(s). This includes metadata on location of content of queued FlowFiles.

- content_repository - contains content claims that can hold the content for 1 too many FlowFiles actively being processed or temporarily archived post termination at end of dataflow(s)
- provenance_repository - contains historical lineage information about FlowFile currently or previously processed through your NiFi dataflows.

- database_repository - contains flow configuration history which is a record of changes made via NiFi UI (adding, modifying, deleting, stopping, starting, etc...). Also contain info about users currently authenticated in to the NiFi node.

Processors that record cluster wide state would use zookeeper to store and retrieve that stored state needed by all nodes. Processors that use local state will write that state to NiFi locally configured state directory. So in addition to protect the repositories mentioned above from dataloss, you'll also want to make sure local state (unique to each node in the NiFi cluster) directory is also protected.
The embedded documentation in NiFi for each component has a section "State management:" that will tell you if that component use local and/or cluster state.

You may find some of the info found in the following articles useful:
https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-hi...
https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Arc...
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

View solution in original post

MattWho · ‎01-12-2023

@SachinMehndirat
There is NO replication of data from the four NiFi repositories across all NiFi nodes in a NiFi cluster. Each NiFi node in the cluster is only aware of and only excutes against the FlowFile on that specific node.

As such, NiFi nodes can not share a common set of repositories. Each node must have their own repositories and it is important to protect those repositories from data loss (flowfile_repository and content_repository being most important).

- flowfile_repository - contain metadata/attributes about FlowFiles actively processing thorugh your NiFi dataflow(s). This includes metadata on location of content of queued FlowFiles.

- content_repository - contains content claims that can hold the content for 1 too many FlowFiles actively being processed or temporarily archived post termination at end of dataflow(s)
- provenance_repository - contains historical lineage information about FlowFile currently or previously processed through your NiFi dataflows.

- database_repository - contains flow configuration history which is a record of changes made via NiFi UI (adding, modifying, deleting, stopping, starting, etc...). Also contain info about users currently authenticated in to the NiFi node.

Processors that record cluster wide state would use zookeeper to store and retrieve that stored state needed by all nodes. Processors that use local state will write that state to NiFi locally configured state directory. So in addition to protect the repositories mentioned above from dataloss, you'll also want to make sure local state (unique to each node in the NiFi cluster) directory is also protected.
The embedded documentation in NiFi for each component has a section "State management:" that will tell you if that component use local and/or cluster state.

You may find some of the info found in the following articles useful:
https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-hi...
https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Arc...
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt

Cloudera Community

Support Questions

Nifi cluster production configuration

REST API Configuration for NiFi 2.0

NIFI Cluster Configuration Error

Configuring NiFi to use LDAPS

error nifi connecting as cluster

Offload NiFi Cluster Nodes using the UI (NiFi 1.8....

External zookeeper and nifi cluster connection iss...

Configuring spark.task.maxFailures & spark.blackli...

NIFI Site to Site connection between Clusters

Configure Hive view for Kerberized Cluster

How to configure git for Nifi Registry in HDF 3.2