Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

GETHDFS recrawling problem

avatar
Expert Contributor

Hello

Everytime i face an error in my nifi workflow, the gethdfs processor recrawls the hdfs directory right from the beginning. I want to keep the files where they are in hdfs (keep source file = true)

How can i have the gethdfs processor continue from where it stopped?

Thanks

1 ACCEPTED SOLUTION

avatar
Super Mentor
@Ahmad Debbas

The GetHDFS processor is deprecated in favor of using ListHDFs and FetchHDFS processors. The GetHDFS processor does not retain state and therefore will start over from the beginning as you noted when an error occurs. The ListHDFS processor does maintain state, so even through NiFi restarts or processor restarts, the listing picks up where it left off. The zero byte FlowFiles produced are then passed to a FetchHDFS that actually retrieves the content and inserts it into the existing FlowFile.

Another advantage to the list/fetch design model is the ability to distribute those listed zero byte files across a Nifi cluster before fetching the content. This improves performance by reducing resource strain caused by GetHDFS on a single NiFi node.

Thanks,

Matt

View solution in original post

2 REPLIES 2

avatar
Super Mentor
@Ahmad Debbas

The GetHDFS processor is deprecated in favor of using ListHDFs and FetchHDFS processors. The GetHDFS processor does not retain state and therefore will start over from the beginning as you noted when an error occurs. The ListHDFS processor does maintain state, so even through NiFi restarts or processor restarts, the listing picks up where it left off. The zero byte FlowFiles produced are then passed to a FetchHDFS that actually retrieves the content and inserts it into the existing FlowFile.

Another advantage to the list/fetch design model is the ability to distribute those listed zero byte files across a Nifi cluster before fetching the content. This improves performance by reducing resource strain caused by GetHDFS on a single NiFi node.

Thanks,

Matt

avatar
Expert Contributor

i am trying the listhdfs processor, for some reason it is only retrieving around 5000 files