Created 10-17-2017 03:33 PM
@Matt.Clarke I read an interesting reply to a question dated 3rd March https://community.hortonworks.com/questions/86732/failover-mechanism-in-nifi.html . You mentioned Data HA across NiFi nodes is a future roadmap item. Has this been implemented? If not is there a release version that it will be implemented in? Also what is the process for retrieving lost data from a Nifi node that can be restarted?
Created 10-17-2017 03:46 PM
Hi @Ben Morris
This feature is still on the roadmap and it's not available yet: https://cwiki.apache.org/confluence/display/NIFI/High+Availability+Processing
What are you trying to achieve? does RAID disk an acceptable solution for you?
Created 10-18-2017 08:16 AM
The data is time critical and must be as near to real time as possible. If a node goes down and there is a delay in that nodes queued data getting to its destination this would not be acceptable as alerts could potentially be delayed. Could you please describe the process for migrating queued data of a failed node to a new node? My thoughts are this could potentially be automated and might fit into an acceptable time frame for data to get to its destination.
Created 10-18-2017 08:22 AM
Also.... Is there estimation for when High Availability Processing will be available or is there a work around that could be put in place?
Created 10-18-2017 08:58 AM
Hi @Ben Morris
I understand the requirement, I have the same needs for few use cases. Unfortunately, there's no ETA for this feature yet. This is something the community is aware of. Getting this done depends on priorities as well as the complexity of this feature.
Regarding migration, data queued in a node can be used again if the node is brought back again. If this is not possible, you can spin up a new node and configure it to use the existing repos from the old node (they are not specific to a NiFi node). IMO this migration process will depend on your infrastructure. If you are on baremetal node with RAID local storage, this will take time as you need to bring back a new physical node with the old disks (if node recovery is not possible). If you are on virtual infrastructure, the task will be easier since you can create new VM, install NiFi and make it use the existing repos. Here also, time and complexity will depend on your storage type (local or network).
Working on HA/fault tolerance with realtime is not an easy task. You have lot of things to consider around data duplication. I am thinking out loud here but If you can afford at-least-once strategy, you can may be design your flow to achieve it (using state backend). There's no easy standard solution though. This will depend you data source, your ability to deduplicate data and so on. This is something I am working currently.
Created 10-18-2017 09:53 AM
Interesting...
In regards to flow design do you mean if a node go's down replay everything that has not been acknowledged as processed at your data source (jms, kafka) and then remove duplicates before pushing to the data's storage destination (kafka, HDFS). I think the Notify, Wait and DetectDuplicate processors would be usfeul in this case.
Created 10-18-2017 09:59 AM
Exactly. This is what I am looking for.
Created 01-29-2019 08:28 PM
Hey,
I see that the feature is still pending on the wiki, does it means that even today this HA feature is not yet available ?
@Ben can you please share the solution proposed by Abdelkrim with notify, wait , detectDuplicate ? How do you store the latest offset from kafka you think you have read ? how the notify works in this case ?
Thanks !
Created 04-19-2019 05:17 AM
I am not able to implement cluster coordinator concept for NIFI UI HA.
any link for that?
And is this(NIFI HA) still not implemented by Hortonworks?
Created 09-07-2020 12:02 AM
We have the similar requirements. We had small POC which worked very well. When we started assessing NFR, we stuck with this bottle neck.
Issue is:
1. We have a clustered environment with three nodes.
2. To check what we did, we had set of processors. And ran a flow. Then we stopped one of them from primary node. All information got reflected correctly on all nodes.
3. We scaled down primary node from where we ran the flow.
4. Earlier we were able to see replicated stuck/queued message on all non-primary nodes. As soon as, primary node was down, other nodes does not show that queued message.
5. When we started back earlier primary node, we can see everything good.
Is there any plan for support for this HA scenario for apache ni-fi?
https://cwiki.apache.org/confluence/display/NIFI/High+Availability+Processing
Please suggest.