Support Questions

Find answers, ask questions, and share your expertise

nifi cluster: durability , aggregation

avatar
Expert Contributor

I have some basic questions about nifi cluster.

I understand that nifi makes use of multiple nodes to speed up the processing of data.

But how do we ensure durability of data processed? Suppose we distribute n flowfiles in a queue to several nodes, with each node getting a subset of n files. Does nifi assign same subset to more than one nodes, so that if one of those processing node goes down, we still have others working on same subset?

Second, how is result of processing  done on different nodes aggregated? Is it done on primary node OR there is some other mechanism?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@manishg 

NiFi has a High Availability at the control plane and not at the data level.
For HA, NiFi utilizes a zero master structure at the NiFi controller level through the use of Zookeeper (ZK).  An external ZK cluster is used for electing a NiFi node as the cluster coordinator.  If the currently elected cluster coordinator goes down, ZK will elect a new cluster coordinator from the remainder available nodes still communicating with ZK.  This eliminates a single point of failure with accessing your NiFi cluster. Any node in the cluster can be used for access.  

The individual nodes in a NiFi cluster load an execute their own local copy of the flow (flow.xml.gz -older versions and flow.json.gz -newer releases). Each NiFi node also maintains its own set of repositories (database, flowfile, content, and provenance).  The flowfile and content repositories only contain metadata/attributes and content for FlowFiles that traverse the dataflows on a specific node.  So node 1 has not information about the data being processed on node 2, 3, etc...

When a node is down, the current queued FlowFiles on that node remain in its content and flowfile repositories until that node is brought back online or a new node is build where these repositories can be moved (you can not merge existing repositories from different nodes).  So it is always best to protect the data stored in these repositories (especially content and flowfile) via RAID to prevent dataloss. 

As far as your last question about aggregation of processing on different nodes, yoru question is not clear to me. Each node operates independently with the exception of perhaps some cluster wide state which may be stored in ZK.  Cluster wide state is primarily used by processors to prevent consumption of same data by different nodes (example listSFTP processor running primary node only and then a change in election happens resulting in different node being elected as primary node. New primary node would start primary node only processors who will retrieve last recorded cluster state and pickup where old primary node processors left off).

It is responsibility of dataflow design engineer to construct dataflow(s) on the NiFi canvas  that distribute data across the NiFi cluster for proper processing.  

If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

View solution in original post

2 REPLIES 2

avatar
Master Mentor

@manishg 

NiFi has a High Availability at the control plane and not at the data level.
For HA, NiFi utilizes a zero master structure at the NiFi controller level through the use of Zookeeper (ZK).  An external ZK cluster is used for electing a NiFi node as the cluster coordinator.  If the currently elected cluster coordinator goes down, ZK will elect a new cluster coordinator from the remainder available nodes still communicating with ZK.  This eliminates a single point of failure with accessing your NiFi cluster. Any node in the cluster can be used for access.  

The individual nodes in a NiFi cluster load an execute their own local copy of the flow (flow.xml.gz -older versions and flow.json.gz -newer releases). Each NiFi node also maintains its own set of repositories (database, flowfile, content, and provenance).  The flowfile and content repositories only contain metadata/attributes and content for FlowFiles that traverse the dataflows on a specific node.  So node 1 has not information about the data being processed on node 2, 3, etc...

When a node is down, the current queued FlowFiles on that node remain in its content and flowfile repositories until that node is brought back online or a new node is build where these repositories can be moved (you can not merge existing repositories from different nodes).  So it is always best to protect the data stored in these repositories (especially content and flowfile) via RAID to prevent dataloss. 

As far as your last question about aggregation of processing on different nodes, yoru question is not clear to me. Each node operates independently with the exception of perhaps some cluster wide state which may be stored in ZK.  Cluster wide state is primarily used by processors to prevent consumption of same data by different nodes (example listSFTP processor running primary node only and then a change in election happens resulting in different node being elected as primary node. New primary node would start primary node only processors who will retrieve last recorded cluster state and pickup where old primary node processors left off).

It is responsibility of dataflow design engineer to construct dataflow(s) on the NiFi canvas  that distribute data across the NiFi cluster for proper processing.  

If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

avatar
Expert Contributor

So basically all nodes perform exactly entire task. There is no divide and rule by default. Flow designer has to introduce any such parallelism by herself.