Created 09-21-2016 03:26 PM
Hi all,
I've read the documentation about HDF 2.0 concerning dataflow and cluster.
http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.0.0/bk_administration/content/clustering.html
Why Cluster? NiFi Administrators or Dataflow Managers (DFMs) may find that using one instance of NiFi on a single server is not enough to process the amount of data they have. So, one solution is to run the same dataflow on multiple NiFi servers. However, this creates a management problem,because each time DFMs want to change or update the dataflow, they must make those changes on each server and then monitor each server individually. By clustering the NiFi servers, it's possible to have that increased processing capability along with a single interface through which to make dataflow changes and monitor the dataflow. Clustering allows the DFM to make each change only once, and thatchange is then replicated to all the nodes of the cluster. Through the single interface, the DFM may also monitor the health and status of all the nodes. NiFi Clustering is unique and has its own terminology. It's important to understand the following terms before setting up a cluster.
My questions :
- "Each node in the cluster performs the same tasks on the data, but each operates on a different set of data"
- "To run same dataflow on multiple Nifi Servers"
==> what's happen exactly on the nifi nodes, an example of use case ?
==> What's happen if node failed ?
Created 09-21-2016 03:53 PM
Nodes in NiFi cluster do not share data. Each works on the very specific data it has received through some ingest type NiFi processor. As such, each node has its own repositories for storing that specific data (FlowFile content) and the metadata (FlowFile attributes) about that node specific data.
As a cluster every node loads and runs the exact same dataflow. One and only one node in the cluster can be the "primary node" at any given time. Some NiFi processors are not cluster friendly and as such should only run on one node in the cluster at any given time. (GetSFTP is a good example) NiFi allows you to configure those processor with a "on primary node" only scheduling strategy. While these processors will still exist on every node in the cluster they will will only run on the primary node. If the primary node designation in the cluster should change at any time, the cluster takes care of stopping the "on primary node" scheduled processors on the original primary node and staring them on the new primary node.
When a node goes down, the other nodes in the cluster will not pickup working on the data that was queued on that down node at this time. That Node as Bryan pointed out will pick up where it left off on its queued data once restored to an operational state provide there was no loss/corruption to either the content or FlowFile repositories on that specific node.
Thanks,
Matt
Created 09-21-2016 03:30 PM
Yes each node runs the same flow and it is the data that is divided across the nodes, with the exception of processors that can be scheduled to run on primary node only. The data on each node is stored in three repositories - flow file repository, content repository, and provenance repository. If a node goes down, the flow file repository has the state of every flow file and will restore to that state when the node comes back so all flow files will be in the same queue where they were before.
Created 09-21-2016 03:39 PM
If the node that goes down happens to be the "primary node", the cluster will automatically elect a new "primary node" from the remaining available nodes and start those "primary node only" processors.
Created 09-21-2016 03:41 PM
Content repository has his own data, so what's happen for this local data?
You told about shared data on the node.. What's exactly happen. Each node is connected to a specific processor?
Created 09-21-2016 03:41 PM
Content repository has his own data, so what's happen for this local data?
You told about shared data on the node.. What's exactly happen. Each node is connected to a specific processor?
Created 09-21-2016 03:53 PM
Nodes in NiFi cluster do not share data. Each works on the very specific data it has received through some ingest type NiFi processor. As such, each node has its own repositories for storing that specific data (FlowFile content) and the metadata (FlowFile attributes) about that node specific data.
As a cluster every node loads and runs the exact same dataflow. One and only one node in the cluster can be the "primary node" at any given time. Some NiFi processors are not cluster friendly and as such should only run on one node in the cluster at any given time. (GetSFTP is a good example) NiFi allows you to configure those processor with a "on primary node" only scheduling strategy. While these processors will still exist on every node in the cluster they will will only run on the primary node. If the primary node designation in the cluster should change at any time, the cluster takes care of stopping the "on primary node" scheduled processors on the original primary node and staring them on the new primary node.
When a node goes down, the other nodes in the cluster will not pickup working on the data that was queued on that down node at this time. That Node as Bryan pointed out will pick up where it left off on its queued data once restored to an operational state provide there was no loss/corruption to either the content or FlowFile repositories on that specific node.
Thanks,
Matt
Created 09-21-2016 03:59 PM
In addition to what Matt said, consider this example...
Lets say there is a three node cluster like this:
node1
node2
node 3
Lets say the flow has GetHttp -> PutFile which is running on all three nodes.
When GetHttp runs on node1 it creates a flow file that is only on node1. If node1 goes down while that flow file is in the queue between the two processors, it will still be there when the node comes back up.
When GetHttp runs on node2 it creates a flow file that is only on node2, same thing for when GetHttp runs on node3.
Created 09-21-2016 05:15 PM
So many thanks for yours answers. As I understand the term 'cluster' is not concerning redundancy or HA dataflow. It is just Nifi nodes share same dataflow but have separate data.
In the gethttp, each webserver sends theirs dates to one Nifi node at the time.
So web1 sends data to node1, web2 sends to node2 etc..
If node1 down, the web1 needs to wait until node1 go up again.
Created 09-21-2016 05:27 PM
There are several different aspects of HA, and a cluster does provide some of them...
For example the primary node and cluster coordinator can automatically failover to another node on the cluster, so that is HA.
Also, in the case where each node in the cluster had a processor listening like ListenHTTP, you can put a load balancer in front of them, so that is HA.
The part that it doesn't provide yet is HA of data, meaning replicating data from node1 to node2, that is something the community is working towards.
Clarifying what you said about GetHTTP... GetHTTP is pulling data, not having it pushed from somewhere. ListenHTTP would be what you described with web1 sending data to node1, web2 sending to node2, etc.
This article might help give you an idea of how things are working in a cluster:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
Created 09-22-2016 08:00 AM
Thanks for yours expla