Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

GETHDFS recrawling problem

Solved Go to solution

GETHDFS recrawling problem

Expert Contributor

Hello

Everytime i face an error in my nifi workflow, the gethdfs processor recrawls the hdfs directory right from the beginning. I want to keep the files where they are in hdfs (keep source file = true)

How can i have the gethdfs processor continue from where it stopped?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: GETHDFS recrawling problem

Master Guru
@Ahmad Debbas

The GetHDFS processor is deprecated in favor of using ListHDFs and FetchHDFS processors. The GetHDFS processor does not retain state and therefore will start over from the beginning as you noted when an error occurs. The ListHDFS processor does maintain state, so even through NiFi restarts or processor restarts, the listing picks up where it left off. The zero byte FlowFiles produced are then passed to a FetchHDFS that actually retrieves the content and inserts it into the existing FlowFile.

Another advantage to the list/fetch design model is the ability to distribute those listed zero byte files across a Nifi cluster before fetching the content. This improves performance by reducing resource strain caused by GetHDFS on a single NiFi node.

Thanks,

Matt

View solution in original post

2 REPLIES 2
Highlighted

Re: GETHDFS recrawling problem

Master Guru
@Ahmad Debbas

The GetHDFS processor is deprecated in favor of using ListHDFs and FetchHDFS processors. The GetHDFS processor does not retain state and therefore will start over from the beginning as you noted when an error occurs. The ListHDFS processor does maintain state, so even through NiFi restarts or processor restarts, the listing picks up where it left off. The zero byte FlowFiles produced are then passed to a FetchHDFS that actually retrieves the content and inserts it into the existing FlowFile.

Another advantage to the list/fetch design model is the ability to distribute those listed zero byte files across a Nifi cluster before fetching the content. This improves performance by reducing resource strain caused by GetHDFS on a single NiFi node.

Thanks,

Matt

View solution in original post

Highlighted

Re: GETHDFS recrawling problem

Expert Contributor

i am trying the listhdfs processor, for some reason it is only retrieving around 5000 files

Don't have an account?
Coming from Hortonworks? Activate your account here