- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How can we configure controller services for "DetectDuplicate" processor in Apache Nifi ?
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
Created ‎08-01-2016 10:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I want to use "DetectDuplicate" processor to remove duplicate JSON content or duplicate tweets and merge into a single file.
Can someone help me in this .@Jeremy Dyer,@Matt Burgess
Thanks in advance.
Created ‎08-01-2016 11:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Yogesh Sharma,
First you need to extract an attribute of your JSON that is considered as an identifier of your JSON content.
Let's say you have:
{"id":"myId", "name":"foo", ...}
You may want to use a EvaluateJsonPath processor to extract the value of "id" into a FlowFile attribute by adding to the processor a property with name = id, and value = $.id
Then you can route FlowFiles to your DetectDuplicate processor. For this processor, you need to setup the map cache service. For this, you need to go into the controller services panel and create two controller services:
- a DistributedMapCacheServer with the default settings
- a DistributedMapCacheClientService with hostname to localhost so that it uses the DistributedMapCacheServer you created.
Then you start the two services, and in your DetectDuplicate processor, you can reference the DistributedMapCacheClientService you defined.
Hope this helps.
Created ‎08-01-2016 11:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Yogesh Sharma,
First you need to extract an attribute of your JSON that is considered as an identifier of your JSON content.
Let's say you have:
{"id":"myId", "name":"foo", ...}
You may want to use a EvaluateJsonPath processor to extract the value of "id" into a FlowFile attribute by adding to the processor a property with name = id, and value = $.id
Then you can route FlowFiles to your DetectDuplicate processor. For this processor, you need to setup the map cache service. For this, you need to go into the controller services panel and create two controller services:
- a DistributedMapCacheServer with the default settings
- a DistributedMapCacheClientService with hostname to localhost so that it uses the DistributedMapCacheServer you created.
Then you start the two services, and in your DetectDuplicate processor, you can reference the DistributedMapCacheClientService you defined.
Hope this helps.
Created ‎08-01-2016 11:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Pierre Villard. My Nifi is installed in cluster so what setting I need to mention in "DistributedMapCacheClientService". And I also read somewhere that we need to mention "nifi.controller.service.configuration.file" in file "nifi.properties".
Can you put some light on this as well?
Created ‎08-01-2016 11:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have a look here: https://community.hortonworks.com/articles/9203/how-to-migrate-a-standalone-nifi-into-a-nifi-clust.h...
It is advised to run the DistributedMapCacheServer on the NCM, then, in DistributedMapCacheClientService, instead of localhost, you can use the IP address of your NCM.
