Support Questions

Find answers, ask questions, and share your expertise

Has Data HA across NiFi nodes be implemented?

avatar
Explorer

@Matt.Clarke I read an interesting reply to a question dated 3rd March https://community.hortonworks.com/questions/86732/failover-mechanism-in-nifi.html . You mentioned Data HA across NiFi nodes is a future roadmap item. Has this been implemented? If not is there a release version that it will be implemented in? Also what is the process for retrieving lost data from a Nifi node that can be restarted?

9 REPLIES 9

avatar

Hi @Ben Morris

This feature is still on the roadmap and it's not available yet: https://cwiki.apache.org/confluence/display/NIFI/High+Availability+Processing

What are you trying to achieve? does RAID disk an acceptable solution for you?

avatar
Explorer

The data is time critical and must be as near to real time as possible. If a node goes down and there is a delay in that nodes queued data getting to its destination this would not be acceptable as alerts could potentially be delayed. Could you please describe the process for migrating queued data of a failed node to a new node? My thoughts are this could potentially be automated and might fit into an acceptable time frame for data to get to its destination.

avatar
Explorer

Also.... Is there estimation for when High Availability Processing will be available or is there a work around that could be put in place?

avatar

Hi @Ben Morris

I understand the requirement, I have the same needs for few use cases. Unfortunately, there's no ETA for this feature yet. This is something the community is aware of. Getting this done depends on priorities as well as the complexity of this feature.

Regarding migration, data queued in a node can be used again if the node is brought back again. If this is not possible, you can spin up a new node and configure it to use the existing repos from the old node (they are not specific to a NiFi node). IMO this migration process will depend on your infrastructure. If you are on baremetal node with RAID local storage, this will take time as you need to bring back a new physical node with the old disks (if node recovery is not possible). If you are on virtual infrastructure, the task will be easier since you can create new VM, install NiFi and make it use the existing repos. Here also, time and complexity will depend on your storage type (local or network).

Working on HA/fault tolerance with realtime is not an easy task. You have lot of things to consider around data duplication. I am thinking out loud here but If you can afford at-least-once strategy, you can may be design your flow to achieve it (using state backend). There's no easy standard solution though. This will depend you data source, your ability to deduplicate data and so on. This is something I am working currently.

avatar
Explorer

Interesting...

In regards to flow design do you mean if a node go's down replay everything that has not been acknowledged as processed at your data source (jms, kafka) and then remove duplicates before pushing to the data's storage destination (kafka, HDFS). I think the Notify, Wait and DetectDuplicate processors would be usfeul in this case.

avatar

Exactly. This is what I am looking for.

avatar
New Contributor

Hey,

I see that the feature is still pending on the wiki, does it means that even today this HA feature is not yet available ?

@Ben can you please share the solution proposed by Abdelkrim with notify, wait , detectDuplicate ? How do you store the latest offset from kafka you think you have read ? how the notify works in this case ?

Thanks !

avatar
New Contributor

I am not able to implement cluster coordinator concept for NIFI UI HA.

any link for that?

And is this(NIFI HA) still not implemented by Hortonworks?

avatar
New Contributor

We have the similar requirements. We had small POC which worked very well. When we started assessing NFR, we stuck with this bottle neck.

 

Issue is:

1. We have a clustered environment with three nodes.
2. To check what we did, we had set of processors. And ran a flow. Then we stopped one of them from primary node. All information got reflected correctly on all nodes.

3. We scaled down primary node from where we ran the flow.

4. Earlier we were able to see replicated stuck/queued message on all non-primary nodes.  As soon as, primary node was down, other nodes does not show that queued message.

5. When we started back earlier primary node, we can see everything good.

 

Is there any plan for support for this HA scenario for apache ni-fi?

https://cwiki.apache.org/confluence/display/NIFI/High+Availability+Processing

 

Please suggest.