Created on 06-14-2017 12:12 PM - edited 08-17-2019 09:11 PM
Hi everybody,
I use Nifi 1.0.0 on AIX server.
My ListFile processor gives the same file in two different dataflows. It schedules every 15 seconds.
The file O27853044.1135 begins to fill at 11:35 and ends at 11:45.
Is it normal that the processor creates a dataflow at 11:42 ?
How avoid ListFile processor to create a dataflow before the end of file's update ?
Thanks for you help
Created 06-14-2017 12:30 PM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 06-14-2017 12:30 PM
Want to get a detailed solution you have to login/registered on the community
Register/LoginCreated 06-14-2017 12:59 PM
I'm going to try number 2.
And could you give me an example of properties for the number 3 and detectduplicate processor ?
Thanks, TV
Created on 06-14-2017 01:59 PM - edited 08-17-2019 09:10 PM
With number 3, I am assuming that every file has a unique filename from which to determine if the same filename has ever been listed more then once. If that is not the case, then you would need to use detectDuplicate after fetching the actual data (less desirable since you will have wasted the resources to potential fetch the same files twice before deleting the duplicate.
Let assume every file has a unique filename. If so the detect duplicate flow would look like this:
with the DetectDuplicate configured as follows:
You will also need to add two controller services to your NiFi:
- DistributedMapCacheServer
- DistributedMapCacheClientService
The value associated to the "filename" attribute on the FlowFile is checked against entries in the DistributedMapCacheServer. If filename does not exist, it is added. If it exists already then FlowFile is routed to duplicate relationship.
In scenario 2 where filenames may be reused we need to detect if the content after fetch is a duplicate or not. IN this case the flow may look like this:
After fetching the content of a FlowFile, the "HashContent" processor is used to create a hash of the content and write it to a FlowFile attribute (default is hash.value). The detectDuplicate processor then configured to look for FlowFile with the same hash.value to determine if they are duplicates.
FlowFiles where the content hash already exist in the distributedMapCacheServer, those FlowFile are routed to duplicate where you can delete them if you like.
If you found this answer addressed your original question, please mark it as accepted by clicking under the answer.
Thanks,
Matt
Created 12-18-2017 02:00 PM
Thank You Matt, I too was facing similar issue and your suggestion worked.
Created 06-14-2017 02:35 PM
The second suggestion works as well.
I kepp the third one for a next usage.
Thanks for all Matt
TV.