Created 11-18-2016 07:48 AM
I am trying to copy files from my local machine to a remote hdfs. I am using GetFile -> PutHDFS processors.
My exact usecase is:
- I want to know as soon as the copy is done (Currently I am using rest api to track bytes tranferred to know this)
- Copy just once
- Keep the source files
Problems I am getting:
- If I configure for keeping the sources files and scheduling time to 0 secs, GetFile processor is creating flowfiles again and again for same files
- I dont think I should configure scheduling time to large value as each task processes only one file and waits for next schedule
Please help.
Open to try other approaches,
Thanks.
Created 11-18-2016 08:29 AM
To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.
Hope this helps.
Created 11-18-2016 08:29 AM
To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.
Hope this helps.
Created 11-18-2016 12:10 PM
In addition to @Pierre Villard 's suggestion, PutHDFS transfers flow files that have been successfully written to HDFS to the "success" relationship, so you can put a processor downstream from PutHDFS (along the "success" relationship", and at that point you can be sure that the file has been successfully written to HDFS, and can proceed accordingly.
Created 11-19-2016 09:22 AM
@Matt Burgess Yes, using "success" relationship I would only know if current (single) flowfile has been wirtten successfully onto hdfs.. how would I know if all my files are finished processing exactly once?
Created 05-21-2018 08:46 AM