Support Questions

SandyClouds · ‎06-07-2023

Hi Team,

I am new to Nifi servers.

I am currently working on a Development project where i developed a job to fetch changes from Mysql databases and apply them in a warehouse.

I have 70 similar but different databases and therefore 70 Nifi jobs. The final aggregated database is around 100 GB. and daily some 1GB changes from all databases.

I am currently working on development environment and using a single server.

Now when i move to production with 70 Nifi jobs running all the time, do I need to think of a Nifi cluster ? or it is fine with single server ? What factors decide cluster vs stabdalone ?

Also How can I handle upgrades to Nifi in future, Is it better for me to use Docker ? or having a normal instance is better ?

Please help me with your valuable knowledge..

cotopaul · ‎06-07-2023

@SandyClouds,

The answer is any case not an easy one, as it mostly depends on what you are planning to do, how, how often and so on 🙂

First things first, you need to know that a Cluster of 5 strong machines is much better than a Cluster of 20 small machines. NiFi is recommended to be scaled vertically and not horizontally.

Now, regarding your questions, to start, if you have enough resources on your single node to be able to sustain all those workflows + 10% (if necessary as failover), than you do not need a cluster to perform your tasks.

There are pro and cons for using a standalone instance, as well as an cluster. To name a few:
Single Node:
PROs:
- easy to manage.
- easy to configure.

- no https required.

CONs:

- in case of issues with the node, you NiFi instance is down.

- it uses plenty of resources, when it needs to process data, as everything is done on a single node.

Cluster:
PROs:
- redundancy and failover --> when a node goes down, the others will take over and process everything, meaning that you will not get affected.

- the used resources will be split among all the nodes, meaning that you can cover more use cases as on a single node.

CONs:

- complex setup as it requires a Zookeeper + plenty of other config files.

- complex to manage --> analysis will be done on X nodes instead of a single node.

Regarding the Docker question, here it is up to you. I am not really a big fan of Docker so my personal opinion is here that you should use a separate physical server with SSD and good CPU and RAM, especially when you want to process analytical workload (billions of actions per hour/day).

So, as a conclusion, both standalone and cluster are good options to use NiFi, but you will have to choose what you want, based on your project requirements and based on your project schedule (for example if new flows will come, you will need to increase the resource and so on)

View solution in original post

MattWho · ‎06-07-2023

@SandyClouds

Some clarity and additions to @cotopaul Pros and Cons:

Single Node:
PROs:
- easy to manage. <-- Setup and managing configuration is easier since you only need to do that on one node. But in a cluster, all nodes configuration files will be almost the same (some variations in hostname properties and certificates if you secure your cluster).
- easy to configure. <-- There are more configurations needed in a cluster setup, but once setup, nothing changes from the user experience when it comes to interacting with the UI.

- no https required. <-- Not sure how this is a PRO. I would not recommend using an un-secure NiFi as doing so allow anyone access to your dataflows and the data being processed. You can also have an un-secure NiFi cluster while i do not recommend that either.

CONs:

- in case of issues with the node, you NiFi instance is down. <-- Very true, single point of failure.

- it uses plenty of resources, when it needs to process data, as everything is done on a single node.

Cluster:
PROs:
- redundancy and failover --> when a node goes down, the others will take over and process everything, meaning that you will not get affected. <-- Not complete accurate. Each node in a NiFi cluster is only aware of the data (FlowFiles) queued on that specific node. So each node works on the FlowFile present on that one node, so it is the responsibility of the dataflow designer/builder to make sure they built their dataflows in such away to ensure distribution of FlowFiles across all nodes. When a node goes down, any data FlowFiles currently queued on that down node are not going to be processed by the other nodes. However, other nodes will continue processing their data and all new data coming in to your dataflow cluster

- the used resources will be split among all the nodes, meaning that you can cover more use cases as on a single node. <-- Different nodes do not share or pool resources from all nodes in the cluster. If your dataflow(s) are built correctly the volume of data (FlowFiles) being processed will be distributed across all your nodes along each node to process a smaller subset of the overall FlowFile volume. This means more resources available across yoru cluster to handle more volume.

NEW -- A NiFi cluster can be accessed via any one of the member nodes. No matter which node's UI you access, you will be presented with stats for all nodes. There is a cluster UI accessible from the global menu that allows you to see a breakdown of each node. Any changes you make from the UI of any one of the member nodes will be replicated to all nodes.

NEW -- Since all nodes run their own copy of the flow, a catastrophic node failure does not mean loss of all your work since the same flow.json.gz (contains everything related to your dataflows) can be retrieved from any of the other nodes in your cluster.

CONs:

- complex setup as it requires a Zookeeper + plenty of other config files. <-- NiFi cluster requires a multi node zookeeper setup. Zookeeper quorum is required for cluster stability and also stores cluster wide state needed for your dataflow. Zookeeper is responsible for electing a node in your cluster with the Cluster Coordinator role and Primary node role. IF a node goes down that has been assigned one of these roles, Zookeeper will elected one of the still up nodes to the role

- complex to manage --> analysis will be done on X nodes instead of a single node. <-- not clear. Yes you have multiple nodes and all those nodes are producing their own set of NiFi-logs. However, if a component within your dataflow is producing bulletins (exceptions) it will report all nodes or the specific node(s) on which bulletin was produced. Cloudera offers centralized management of your NiFi cluster deployment via Cloudera Manager software. Makes deploying and managing NiFi cluster to multiple nodes easy, sets up and configures Zookeeper for you, and makes securing your NiFi easy as well by generating the needed certificates/keystores for you.

Hope this helps,

Matt

View solution in original post

cotopaul · ‎06-07-2023

@SandyClouds,

The answer is any case not an easy one, as it mostly depends on what you are planning to do, how, how often and so on 🙂

First things first, you need to know that a Cluster of 5 strong machines is much better than a Cluster of 20 small machines. NiFi is recommended to be scaled vertically and not horizontally.

Now, regarding your questions, to start, if you have enough resources on your single node to be able to sustain all those workflows + 10% (if necessary as failover), than you do not need a cluster to perform your tasks.

There are pro and cons for using a standalone instance, as well as an cluster. To name a few:
Single Node:
PROs:
- easy to manage.
- easy to configure.

- no https required.

CONs:

- in case of issues with the node, you NiFi instance is down.

- it uses plenty of resources, when it needs to process data, as everything is done on a single node.

Cluster:
PROs:
- redundancy and failover --> when a node goes down, the others will take over and process everything, meaning that you will not get affected.

- the used resources will be split among all the nodes, meaning that you can cover more use cases as on a single node.

CONs:

- complex setup as it requires a Zookeeper + plenty of other config files.

- complex to manage --> analysis will be done on X nodes instead of a single node.

Regarding the Docker question, here it is up to you. I am not really a big fan of Docker so my personal opinion is here that you should use a separate physical server with SSD and good CPU and RAM, especially when you want to process analytical workload (billions of actions per hour/day).

So, as a conclusion, both standalone and cluster are good options to use NiFi, but you will have to choose what you want, based on your project requirements and based on your project schedule (for example if new flows will come, you will need to increase the resource and so on)

MattWho · ‎06-07-2023

@SandyClouds

Some clarity and additions to @cotopaul Pros and Cons:

Single Node:
PROs:
- easy to manage. <-- Setup and managing configuration is easier since you only need to do that on one node. But in a cluster, all nodes configuration files will be almost the same (some variations in hostname properties and certificates if you secure your cluster).
- easy to configure. <-- There are more configurations needed in a cluster setup, but once setup, nothing changes from the user experience when it comes to interacting with the UI.

- no https required. <-- Not sure how this is a PRO. I would not recommend using an un-secure NiFi as doing so allow anyone access to your dataflows and the data being processed. You can also have an un-secure NiFi cluster while i do not recommend that either.

CONs:

- in case of issues with the node, you NiFi instance is down. <-- Very true, single point of failure.

- it uses plenty of resources, when it needs to process data, as everything is done on a single node.

Cluster:
PROs:
- redundancy and failover --> when a node goes down, the others will take over and process everything, meaning that you will not get affected. <-- Not complete accurate. Each node in a NiFi cluster is only aware of the data (FlowFiles) queued on that specific node. So each node works on the FlowFile present on that one node, so it is the responsibility of the dataflow designer/builder to make sure they built their dataflows in such away to ensure distribution of FlowFiles across all nodes. When a node goes down, any data FlowFiles currently queued on that down node are not going to be processed by the other nodes. However, other nodes will continue processing their data and all new data coming in to your dataflow cluster

- the used resources will be split among all the nodes, meaning that you can cover more use cases as on a single node. <-- Different nodes do not share or pool resources from all nodes in the cluster. If your dataflow(s) are built correctly the volume of data (FlowFiles) being processed will be distributed across all your nodes along each node to process a smaller subset of the overall FlowFile volume. This means more resources available across yoru cluster to handle more volume.

NEW -- A NiFi cluster can be accessed via any one of the member nodes. No matter which node's UI you access, you will be presented with stats for all nodes. There is a cluster UI accessible from the global menu that allows you to see a breakdown of each node. Any changes you make from the UI of any one of the member nodes will be replicated to all nodes.

NEW -- Since all nodes run their own copy of the flow, a catastrophic node failure does not mean loss of all your work since the same flow.json.gz (contains everything related to your dataflows) can be retrieved from any of the other nodes in your cluster.

CONs:

- complex setup as it requires a Zookeeper + plenty of other config files. <-- NiFi cluster requires a multi node zookeeper setup. Zookeeper quorum is required for cluster stability and also stores cluster wide state needed for your dataflow. Zookeeper is responsible for electing a node in your cluster with the Cluster Coordinator role and Primary node role. IF a node goes down that has been assigned one of these roles, Zookeeper will elected one of the still up nodes to the role

- complex to manage --> analysis will be done on X nodes instead of a single node. <-- not clear. Yes you have multiple nodes and all those nodes are producing their own set of NiFi-logs. However, if a component within your dataflow is producing bulletins (exceptions) it will report all nodes or the specific node(s) on which bulletin was produced. Cloudera offers centralized management of your NiFi cluster deployment via Cloudera Manager software. Makes deploying and managing NiFi cluster to multiple nodes easy, sets up and configures Zookeeper for you, and makes securing your NiFi easy as well by generating the needed certificates/keystores for you.

Hope this helps,

Matt

steven-matison · ‎06-07-2023

@SandyClouds You should really check out DataFlow. 70 jobs in one nifi, many nifis, or containerized nifi is going to be a big job to manage. Not only the setup, but the operation over time. Thats not even getting into sizing, performance, etc. These types of activities are eliminated when you deploy and operate flows in DataFlow. Here you are able to deploy multiple copies of same flow, operate them with auto scale, as well as be able to fully ci/cd the entire process to create, start, restart, etc. This latter concept is how you achieve a smooth operation of 70+ flows and never actually touch or admin nifi.

Happy to demo for you if you want to take a look.

Cloudera Community

Support Questions

Nifi cluster or standalone, Nifi Docker or without docker

NiFi cluster sandbox on Docker

Connecting Nifi to LDAP with Docker

Apache NiFi 1.1.0 on Docker

How to access Nifi REST API 2.0.0 from a docker co...

Docker - Installing HDP using Ambari and Creating ...

Docker storage drivers overview

Running Apache Nifi on Docker without 0.0.0.0

Dockerized YARN Services - Quickstart

How to Migrate a Standalone NiFi into a NiFI Clust...

NiFi Cluster and Load Balancer