Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How do you use GetFile to read the latest file exactly once and push it to KafkaProducer ?

Solved Go to solution

How do you use GetFile to read the latest file exactly once and push it to KafkaProducer ?

New Contributor

Hi

I'm looking to read the most recent/latest file exactly once from a directory using NiFi Getfile processor and push it to KafkaProducer. The Kafka Consumer should get the messages on the topic only once.

Current config:

GetFile connected to KafkaProducer in NiFi. I'm using pykafka consumer to read the messages for further processing.

I am using GetFile processor and playing around with MinFileAge, MaxFileAge and scheduling to ensure I see only one set of messages in KafkaConsumer. With some experiments, I've always seen the same content appearing twice.

Thanks

Chetan

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How do you use GetFile to read the latest file exactly once and push it to KafkaProducer ?

Super Guru

You may want to use ListFile -> FetchFile rather than GetFile. ListFile will keep track of the files it has found and will not list them again unless they have been updated (and still satisfy the other filters you specify in the properties).

Can you describe your use case a bit more? Is it the case that many files may be placed in the directory "at once" but you only want the latest one? Also do the files need to remain in that directory? If so, I think ListFile -> FetchFile is your best bet, but if not, you can set GetFile to remove the file on read. Then only "new" files will be found by GetFile (because any files processed would be removed).

View solution in original post

2 REPLIES 2
Highlighted

Re: How do you use GetFile to read the latest file exactly once and push it to KafkaProducer ?

Super Guru

You may want to use ListFile -> FetchFile rather than GetFile. ListFile will keep track of the files it has found and will not list them again unless they have been updated (and still satisfy the other filters you specify in the properties).

Can you describe your use case a bit more? Is it the case that many files may be placed in the directory "at once" but you only want the latest one? Also do the files need to remain in that directory? If so, I think ListFile -> FetchFile is your best bet, but if not, you can set GetFile to remove the file on read. Then only "new" files will be found by GetFile (because any files processed would be removed).

View solution in original post

Highlighted

Re: How do you use GetFile to read the latest file exactly once and push it to KafkaProducer ?

New Contributor

Thanks Matt. That worked. Yes, I need to ensure the files remain the directory. I could do a Putfile to a temp/backup dir and do a GetFile with remove-on-read. Many files will not be placed in the directory at once. By default we would need to process only the latest file.

Cheers!

Don't have an account?
Coming from Hortonworks? Activate your account here