Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to know if files are copied to hdfs?

Solved Go to solution
Highlighted

How to know if files are copied to hdfs?

I am trying to copy files from my local machine to a remote hdfs. I am using GetFile -> PutHDFS processors.

My exact usecase is:

- I want to know as soon as the copy is done (Currently I am using rest api to track bytes tranferred to know this)

- Copy just once

- Keep the source files

Problems I am getting:

- If I configure for keeping the sources files and scheduling time to 0 secs, GetFile processor is creating flowfiles again and again for same files

- I dont think I should configure scheduling time to large value as each task processes only one file and waits for next schedule

Please help.

Open to try other approaches,

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to know if files are copied to hdfs?

Hi @Karthik Manchala,

To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.

Hope this helps.

View solution in original post

4 REPLIES 4
Highlighted

Re: How to know if files are copied to hdfs?

Hi @Karthik Manchala,

To achieve what you are looking for, I'd replace the GetFile processor by the combination of ListFile and FetchFile processors. The first one will list files according to your conditions and will emit an empty flow files for each listed file with an attribute containing the path of the file to retrieve. The second one will actually fetch the content of the file for the given path. The first processor has a "state" and will keep information regarding already processed files so that it won't consume the same file multiple times. Besides, this approach is also recommended to allow a better load distribution when you have a NiFi cluster.

Hope this helps.

View solution in original post

Highlighted

Re: How to know if files are copied to hdfs?

Super Guru

In addition to @Pierre Villard 's suggestion, PutHDFS transfers flow files that have been successfully written to HDFS to the "success" relationship, so you can put a processor downstream from PutHDFS (along the "success" relationship", and at that point you can be sure that the file has been successfully written to HDFS, and can proceed accordingly.

Highlighted

Re: How to know if files are copied to hdfs?

@Matt Burgess Yes, using "success" relationship I would only know if current (single) flowfile has been wirtten successfully onto hdfs.. how would I know if all my files are finished processing exactly once?

Highlighted

Re: How to know if files are copied to hdfs?

New Contributor
@Matt Burgess Did you find any solution to check whether all files are copied successfully?
Don't have an account?
Coming from Hortonworks? Activate your account here