Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDF 2.0 Cluster. How manager dataflow and failover

avatar
Rising Star

Hi all,

I've read the documentation about HDF 2.0 concerning dataflow and cluster.

http://docs.hortonworks.com/HDPDocuments/HDF2/HDF-2.0.0/bk_administration/content/clustering.html
Why Cluster?
NiFi Administrators or Dataflow Managers (DFMs) may find that using one instance of NiFi
on a single server is not enough to process the amount of data they have. So, one solution is to run the same dataflow on multiple NiFi servers. However, this creates a management problem,because each time DFMs want to change or update the dataflow, they must make those changes on each server and then monitor each server individually. By clustering the NiFi servers, it's possible to have that increased processing capability along with a single interface through which to make dataflow changes and monitor the dataflow. Clustering allows the DFM to make each change only once, and thatchange is then replicated to all the nodes of the cluster. Through the single interface, the DFM may also monitor the health and status of all the nodes.
NiFi Clustering is unique and has its own terminology. It's important to understand the following terms before setting up a cluster.

My questions :

- "Each node in the cluster performs the same tasks on the data, but each operates on a different set of data"

- "To run same dataflow on multiple Nifi Servers"

==> what's happen exactly on the nifi nodes, an example of use case ?

==> What's happen if node failed ?

1 ACCEPTED SOLUTION

avatar
Super Mentor

@mayki wogno

Nodes in NiFi cluster do not share data. Each works on the very specific data it has received through some ingest type NiFi processor. As such, each node has its own repositories for storing that specific data (FlowFile content) and the metadata (FlowFile attributes) about that node specific data.

As a cluster every node loads and runs the exact same dataflow. One and only one node in the cluster can be the "primary node" at any given time. Some NiFi processors are not cluster friendly and as such should only run on one node in the cluster at any given time. (GetSFTP is a good example) NiFi allows you to configure those processor with a "on primary node" only scheduling strategy. While these processors will still exist on every node in the cluster they will will only run on the primary node. If the primary node designation in the cluster should change at any time, the cluster takes care of stopping the "on primary node" scheduled processors on the original primary node and staring them on the new primary node.

When a node goes down, the other nodes in the cluster will not pickup working on the data that was queued on that down node at this time. That Node as Bryan pointed out will pick up where it left off on its queued data once restored to an operational state provide there was no loss/corruption to either the content or FlowFile repositories on that specific node.

Thanks,

Matt

View solution in original post

11 REPLIES 11

avatar
Rising Star

I've got another question about multi-dataflows in the same cluster.

I've two projects (Project1, Project2) with two DFM's (team1 and team2).

It is possible to use the same cluster for sharing resources (CPU/MEMORY/DISKS) but each DFM has their own WEB UI for manage their own dataflow ?

Or each cluster has only one WEB UI with one only dataflow ?

avatar
Super Mentor
@mayki wogno

When asking new questions unrelated to the current thread, please start a new Community Connection question. This benefits the community at large who may be searching for answers to the same question.