Support Questions
Find answers, ask questions, and share your expertise

How do I remember the last index for an incremental fetch using NiFi in a multinode cluster?

I am having a hard time understanding how to build a flow for a simple use case. The use case is doing an incremental fetch from a rest api. I just need to store the last index I have retrieved, and use that index as the starting index for my next fetch. I can think of a few ways of doing this but all of them seem to have problems. Here are my ideas and thoughts, how are other people tackling this use case?

UpdateAttribute Stored State

Use updateAttribute with stored state to store my index.

Problem: update attribute only stores state locally, if my node goes down or there is a primary node switch, thats a problem

Store State in Flowfiles

Loop the output of my getHTTP through some updateAttribute stages that do something like ${new_beginning}=${last_end}

Problem: the node with my state-storing flowfiles could go down. No way to "drain" a node if its got these long-lived flowfiles

Store State in an external RDBMS

Just store my index in some external db

Problem: none really, just extra burden and dependencies

Distributed Cache Controllers

Talk to a controller service that stores state.

Problem: From researching I find out that distributedMapCache isn't actually cluster wide, do I understand that correctly?

Am I missing anything? How are others solving this use case that I imagine is very common?


Super Guru

@David Miller

You can use PutDistributedMapCache processors to store your last index and in the next call you can use FetchDistributeMapCache to get the last index value and use it next api call and Distributed MapCache works on cluster wise.

I have answered similar kind use case before refer to this link, get the .xml and upload to your nifi instance and you just need to replace ExecuteSql processor to InvokeHTTP processor and configure the invokeHTTP processor with all the properties.

The distributed map cache is cluster wide, but the standard DistributedMapCacheServer only runs on one node of the cluster, so if you are concerned with that node going down then you can look at the other DistributedMapCacheClient implementations that can talk to Redis or HBase which both offer high availability.