- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to know if files are copied to hdfs?
- Labels:
-
Apache NiFi
Created ‎11-18-2016 07:48 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to copy files from my local machine to a remote hdfs. I am using GetFile -> PutHDFS processors.
My exact usecase is:
- I want to know as soon as the copy is done (Currently I am using rest api to track bytes tranferred to know this)
- Copy just once
- Keep the source files
Problems I am getting:
- If I configure for keeping the sources files and scheduling time to 0 secs, GetFile processor is creating flowfiles again and again for same files
- I dont think I should configure scheduling time to large value as each task processes only one file and waits for next schedule
Please help.
Open to try other approaches,
Thanks.
Created ‎11-18-2016 08:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.
Hope this helps.
Created ‎11-18-2016 08:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.
Hope this helps.
Created ‎11-18-2016 12:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition to @Pierre Villard 's suggestion, PutHDFS transfers flow files that have been successfully written to HDFS to the "success" relationship, so you can put a processor downstream from PutHDFS (along the "success" relationship", and at that point you can be sure that the file has been successfully written to HDFS, and can proceed accordingly.
Created ‎11-19-2016 09:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Matt Burgess Yes, using "success" relationship I would only know if current (single) flowfile has been wirtten successfully onto hdfs.. how would I know if all my files are finished processing exactly once?
Created ‎05-21-2018 08:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
