Created 01-26-2017 03:07 PM
Hi
I'm looking to read the most recent/latest file exactly once from a directory using NiFi Getfile processor and push it to KafkaProducer. The Kafka Consumer should get the messages on the topic only once.
Current config:
GetFile connected to KafkaProducer in NiFi. I'm using pykafka consumer to read the messages for further processing.
I am using GetFile processor and playing around with MinFileAge, MaxFileAge and scheduling to ensure I see only one set of messages in KafkaConsumer. With some experiments, I've always seen the same content appearing twice.
Thanks
Chetan
Created 01-26-2017 04:32 PM
You may want to use ListFile -> FetchFile rather than GetFile. ListFile will keep track of the files it has found and will not list them again unless they have been updated (and still satisfy the other filters you specify in the properties).
Can you describe your use case a bit more? Is it the case that many files may be placed in the directory "at once" but you only want the latest one? Also do the files need to remain in that directory? If so, I think ListFile -> FetchFile is your best bet, but if not, you can set GetFile to remove the file on read. Then only "new" files will be found by GetFile (because any files processed would be removed).
Created 01-26-2017 04:32 PM
You may want to use ListFile -> FetchFile rather than GetFile. ListFile will keep track of the files it has found and will not list them again unless they have been updated (and still satisfy the other filters you specify in the properties).
Can you describe your use case a bit more? Is it the case that many files may be placed in the directory "at once" but you only want the latest one? Also do the files need to remain in that directory? If so, I think ListFile -> FetchFile is your best bet, but if not, you can set GetFile to remove the file on read. Then only "new" files will be found by GetFile (because any files processed would be removed).
Created 01-29-2017 02:39 PM
Thanks Matt. That worked. Yes, I need to ensure the files remain the directory. I could do a Putfile to a temp/backup dir and do a GetFile with remove-on-read. Many files will not be placed in the directory at once. By default we would need to process only the latest file.
Cheers!