Created 05-24-2016 01:50 PM
In the NiFi WebCrawler template located here:
https://github.com/hortonworks-gallery/nifi-templates/tree/master/templates
There is a "remove duplicates" processor that uses a DistributedMapCacheClientService. I tried to google/bing that, but I couldn't come up with exactly what that is. Is it something I have to install/configure/enable/? If someone could point me to information on Distributed Cache Service, what it is used for and how to use it, I would greatly appreciate it (as you can probably guess, I'm pretty new to Hadoop).
Created 05-24-2016 01:58 PM
The DistributedMapCache is a NiFi concept which is used to store information for later retrieval, either by the current processor by another processor. There are two components - the DistributedMapCacheServer which runs on one node if you are in a cluster, and the DistributedMapCacheClientService which runs on all nodes if in a cluster, and communicates with the server. Both of these are Controller Services, configured in NiFi through the controller section in the top right toolbar. Processors use the client service to store and retrieve data from the cache server. In this case, DetectDuplicate uses the cache to store information about what it has seen and determine if it is a duplicate.
Created 05-24-2016 01:54 PM
I believe the information you are looking for is here:
You have links to three pages talking about distributed cache.
In short it gives you the ability to have a map key/value to store information along your flow. The service is what you reference in some processors to link the processors to this map.
Hope this helps.
Created 12-05-2016 09:56 AM
any thoughts on how to clear this DMC cache.. Suppose I have 4 entries in DEPT_LKP table.. DEPT_NO 10, 20, 30, 40 get loaded to DMC.. in Future if i delete DEPT_NO 20 entry from source table.. DMC wont delete it from the cache.. worse part is.. it will use the cached value of DEPT_NO 20..
Created 05-24-2016 01:58 PM
The DistributedMapCache is a NiFi concept which is used to store information for later retrieval, either by the current processor by another processor. There are two components - the DistributedMapCacheServer which runs on one node if you are in a cluster, and the DistributedMapCacheClientService which runs on all nodes if in a cluster, and communicates with the server. Both of these are Controller Services, configured in NiFi through the controller section in the top right toolbar. Processors use the client service to store and retrieve data from the cache server. In this case, DetectDuplicate uses the cache to store information about what it has seen and determine if it is a duplicate.