Created 08-01-2016 10:57 AM
I want to use "DetectDuplicate" processor to remove duplicate JSON content or duplicate tweets and merge into a single file.
Can someone help me in this .@Jeremy Dyer,@Matt Burgess
Thanks in advance.
Created 08-01-2016 11:10 AM
Hi @Yogesh Sharma,
First you need to extract an attribute of your JSON that is considered as an identifier of your JSON content.
Let's say you have:
{"id":"myId", "name":"foo", ...}
You may want to use a EvaluateJsonPath processor to extract the value of "id" into a FlowFile attribute by adding to the processor a property with name = id, and value = $.id
Then you can route FlowFiles to your DetectDuplicate processor. For this processor, you need to setup the map cache service. For this, you need to go into the controller services panel and create two controller services:
- a DistributedMapCacheServer with the default settings
- a DistributedMapCacheClientService with hostname to localhost so that it uses the DistributedMapCacheServer you created.
Then you start the two services, and in your DetectDuplicate processor, you can reference the DistributedMapCacheClientService you defined.
Hope this helps.
Created 08-01-2016 11:10 AM
Hi @Yogesh Sharma,
First you need to extract an attribute of your JSON that is considered as an identifier of your JSON content.
Let's say you have:
{"id":"myId", "name":"foo", ...}
You may want to use a EvaluateJsonPath processor to extract the value of "id" into a FlowFile attribute by adding to the processor a property with name = id, and value = $.id
Then you can route FlowFiles to your DetectDuplicate processor. For this processor, you need to setup the map cache service. For this, you need to go into the controller services panel and create two controller services:
- a DistributedMapCacheServer with the default settings
- a DistributedMapCacheClientService with hostname to localhost so that it uses the DistributedMapCacheServer you created.
Then you start the two services, and in your DetectDuplicate processor, you can reference the DistributedMapCacheClientService you defined.
Hope this helps.
Created 08-01-2016 11:29 AM
Thanks Pierre Villard. My Nifi is installed in cluster so what setting I need to mention in "DistributedMapCacheClientService". And I also read somewhere that we need to mention "nifi.controller.service.configuration.file" in file "nifi.properties".
Can you put some light on this as well?
Created 08-01-2016 11:35 AM
Have a look here: https://community.hortonworks.com/articles/9203/how-to-migrate-a-standalone-nifi-into-a-nifi-clust.h...
It is advised to run the DistributedMapCacheServer on the NCM, then, in DistributedMapCacheClientService, instead of localhost, you can use the IP address of your NCM.