Support Questions

Find answers, ask questions, and share your expertise

HDF 2.0 Cluster. How manager dataflow and failover

avatar
Rising Star

Hi all,

I've read the documentation about HDF 2.0 concerning dataflow and cluster.

http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.0.0/bk_administration/content/clustering.html
Why Cluster?
NiFi Administrators or Dataflow Managers (DFMs) may find that using one instance of NiFi
on a single server is not enough to process the amount of data they have. So, one solution is to run the same dataflow on multiple NiFi servers. However, this creates a management problem,because each time DFMs want to change or update the dataflow, they must make those changes on each server and then monitor each server individually. By clustering the NiFi servers, it's possible to have that increased processing capability along with a single interface through which to make dataflow changes and monitor the dataflow. Clustering allows the DFM to make each change only once, and thatchange is then replicated to all the nodes of the cluster. Through the single interface, the DFM may also monitor the health and status of all the nodes.
NiFi Clustering is unique and has its own terminology. It's important to understand the following terms before setting up a cluster.

My questions :

- "Each node in the cluster performs the same tasks on the data, but each operates on a different set of data"

- "To run same dataflow on multiple Nifi Servers"

==> what's happen exactly on the nifi nodes, an example of use case ?

==> What's happen if node failed ?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@mayki wogno

Nodes in NiFi cluster do not share data. Each works on the very specific data it has received through some ingest type NiFi processor. As such, each node has its own repositories for storing that specific data (FlowFile content) and the metadata (FlowFile attributes) about that node specific data.

As a cluster every node loads and runs the exact same dataflow. One and only one node in the cluster can be the "primary node" at any given time. Some NiFi processors are not cluster friendly and as such should only run on one node in the cluster at any given time. (GetSFTP is a good example) NiFi allows you to configure those processor with a "on primary node" only scheduling strategy. While these processors will still exist on every node in the cluster they will will only run on the primary node. If the primary node designation in the cluster should change at any time, the cluster takes care of stopping the "on primary node" scheduled processors on the original primary node and staring them on the new primary node.

When a node goes down, the other nodes in the cluster will not pickup working on the data that was queued on that down node at this time. That Node as Bryan pointed out will pick up where it left off on its queued data once restored to an operational state provide there was no loss/corruption to either the content or FlowFile repositories on that specific node.

Thanks,

Matt

View solution in original post

11 REPLIES 11

avatar
Master Guru

Yes each node runs the same flow and it is the data that is divided across the nodes, with the exception of processors that can be scheduled to run on primary node only. The data on each node is stored in three repositories - flow file repository, content repository, and provenance repository. If a node goes down, the flow file repository has the state of every flow file and will restore to that state when the node comes back so all flow files will be in the same queue where they were before.

avatar
Master Mentor

If the node that goes down happens to be the "primary node", the cluster will automatically elect a new "primary node" from the remaining available nodes and start those "primary node only" processors.

avatar
Rising Star

Content repository has his own data, so what's happen for this local data?

You told about shared data on the node.. What's exactly happen. Each node is connected to a specific processor?

avatar
Rising Star

Content repository has his own data, so what's happen for this local data?

You told about shared data on the node.. What's exactly happen. Each node is connected to a specific processor?

avatar
Master Mentor

@mayki wogno

Nodes in NiFi cluster do not share data. Each works on the very specific data it has received through some ingest type NiFi processor. As such, each node has its own repositories for storing that specific data (FlowFile content) and the metadata (FlowFile attributes) about that node specific data.

As a cluster every node loads and runs the exact same dataflow. One and only one node in the cluster can be the "primary node" at any given time. Some NiFi processors are not cluster friendly and as such should only run on one node in the cluster at any given time. (GetSFTP is a good example) NiFi allows you to configure those processor with a "on primary node" only scheduling strategy. While these processors will still exist on every node in the cluster they will will only run on the primary node. If the primary node designation in the cluster should change at any time, the cluster takes care of stopping the "on primary node" scheduled processors on the original primary node and staring them on the new primary node.

When a node goes down, the other nodes in the cluster will not pickup working on the data that was queued on that down node at this time. That Node as Bryan pointed out will pick up where it left off on its queued data once restored to an operational state provide there was no loss/corruption to either the content or FlowFile repositories on that specific node.

Thanks,

Matt

avatar
Master Guru

In addition to what Matt said, consider this example...

Lets say there is a three node cluster like this:

node1

  • - content repo on node1
  • - flow file repo on node1
  • - provenance repo on node1

node2

  • - content repo on node2
  • - flow file repo on node2
  • - provenance repo on node2

node 3

  • - content repo on node3
  • - flow file repo on node3
  • - provenance repo on node3

Lets say the flow has GetHttp -> PutFile which is running on all three nodes.

When GetHttp runs on node1 it creates a flow file that is only on node1. If node1 goes down while that flow file is in the queue between the two processors, it will still be there when the node comes back up.

When GetHttp runs on node2 it creates a flow file that is only on node2, same thing for when GetHttp runs on node3.

avatar
Rising Star

So many thanks for yours answers. As I understand the term 'cluster' is not concerning redundancy or HA dataflow. It is just Nifi nodes share same dataflow but have separate data.

In the gethttp, each webserver sends theirs dates to one Nifi node at the time.

So web1 sends data to node1, web2 sends to node2 etc..

If node1 down, the web1 needs to wait until node1 go up again.

avatar
Master Guru

There are several different aspects of HA, and a cluster does provide some of them...

For example the primary node and cluster coordinator can automatically failover to another node on the cluster, so that is HA.

Also, in the case where each node in the cluster had a processor listening like ListenHTTP, you can put a load balancer in front of them, so that is HA.

The part that it doesn't provide yet is HA of data, meaning replicating data from node1 to node2, that is something the community is working towards.

Clarifying what you said about GetHTTP... GetHTTP is pulling data, not having it pushed from somewhere. ListenHTTP would be what you described with web1 sending data to node1, web2 sending to node2, etc.

This article might help give you an idea of how things are working in a cluster:

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

avatar
Rising Star

Thanks for yours expla