Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Clarifications on state management within NiFi processors

avatar
Contributor

I havent gone through the code for the DistributedCache yet. After the first try today, I had a few questions that came up.

1) Do we have to use different Distributed cache Server/Client for each Processor that has State management? For example can we use the same Distributed Cache Serve/Client for both ListSFTP and ListHDFS within the Processor Group?

2) If we specify a value in the "persistence directory" in the DistributedCacheServer, the assumption is the cache is present in both the memory and in Disk? Is the understanding correct?

3) The expectation of using the DistributedCachingService is that the state is maintained across the cluster which means even if we lose the node that was primarily running the Processor, we still do not duplicate the listing. But when I try to understand the thread in the link http://mail-archives.apache.org/mod_mbox/nifi-users/201611.mbox/%3CCA%2BWJ-%2B%2Bqdkg-qRzP-7gUAX%2BA... specifically the line "it does not implement any coordination logic that work nicely with NiFi cluster" I am not sure I exactly follow the issue. Please clarify.

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
8 REPLIES 8

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

@Wynner Thanks much for the response. Follow up question for Question 3. Say I have 4 nodes on the cluster, Node 1 is the Primary node and Node 2 is configured to use by the DistributedMapCacheClient controller service. If the Node 2 goes down, then the state information is lost and the next scheduled List will have everything that has been already processed along with any new ones. Is the understanding right? If yes, this doesn't really seem to have a Distributed behavior (that is probably what the other thread is talking about).

avatar

@Prakash Ravi

Basically correct, which is why NiFi uses zookeeper for state information now.

I wouldn't use the DistributedMapCache unless I absolutely had too. Which processors are you using?

avatar
Contributor

@Wynner I am using ListSFTP. But the behavior should not change based on processors right?

avatar

@Prakash Ravi

Correct, but the ListSFTP processor does not require a Distributed Cache Service to maintain state. So, don't create one. It will use zookeeper by default.

avatar
Contributor

@WynnerI am unable to respond to your last response. Thanks for the answers again. Is there a list of processors that uses zookeeper by default for state management (or does all of them use)? I assume ListHDFS work the same way.

When and what exactly is the scenario one would use DistributedCacheService for? I tested ListSFTP by bringing down the primary node, restarting all nodes and it seems to work as expected for the listing without any DistributedCacheService configuration.

avatar
Contributor

@WynnerAlso, how do we clear the state if we want to re-list the files (maybe because there was an issue with the processing of the data)?

avatar
@Prakash Ravi

There are only three processors which require a Distributed map Cache server: DetectDuplicate, FetchDistributedMapCache and PutDistributedMapCache. The rest will use zookeeper were applicable.

To clear the state of a processor, just do the following steps

Right click on a processor, select View state from the menu

15594-screen-shot-2017-05-19-at-20851-pm.png

Then just click Clear state and the files will be listed again.

15595-screen-shot-2017-05-19-at-21048-pm.png