Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to know if files are copied to hdfs?

avatar
Contributor

I am trying to copy files from my local machine to a remote hdfs. I am using GetFile -> PutHDFS processors.

My exact usecase is:

- I want to know as soon as the copy is done (Currently I am using rest api to track bytes tranferred to know this)

- Copy just once

- Keep the source files

Problems I am getting:

- If I configure for keeping the sources files and scheduling time to 0 secs, GetFile processor is creating flowfiles again and again for same files

- I dont think I should configure scheduling time to large value as each task processes only one file and waits for next schedule

Please help.

Open to try other approaches,

Thanks.

1 ACCEPTED SOLUTION

avatar

Hi @Karthik Manchala,

To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.

Hope this helps.

View solution in original post

4 REPLIES 4

avatar

Hi @Karthik Manchala,

To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.

Hope this helps.

avatar
Master Guru

In addition to @Pierre Villard 's suggestion, PutHDFS transfers flow files that have been successfully written to HDFS to the "success" relationship, so you can put a processor downstream from PutHDFS (along the "success" relationship", and at that point you can be sure that the file has been successfully written to HDFS, and can proceed accordingly.

avatar
Contributor

@Matt Burgess Yes, using "success" relationship I would only know if current (single) flowfile has been wirtten successfully onto hdfs.. how would I know if all my files are finished processing exactly once?

avatar
New Contributor
@Matt Burgess Did you find any solution to check whether all files are copied successfully?