Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

FetchHDFS Process to fetch Nested data in HDFS

avatar
Rising Star

Hi All,

I want to fetch the data that is stored in HDFS using FetchHDFS processor .

The folder structure to store our data is like /MajorData/Location/Year/Month/Day/file1.txt (/MajorData/Location/2017/01/01/file1.txt) As the day changes the folder structure will change to /MajorData/Location/2017/01/02/file2.txt

How can I write a Nifi expression which will traverse through all the folders, fetch the data in NiFi?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Akash S

The ListHDFS processor records state so that only new files are listed. The processor also has a configuration option for recursing subdirectories. You could set the directory to only /MajorData/Location/ and let it list all files from the subdirectories. As new subdirectories are created, the files within those new directories will get listed.

If that does not work for you, the NiFi expression language (EL) statement that you are looking for would look something like this for the directory:

/MajorData/Location/${now():format('yyyy/MM/dd')}

The above would cause Nifi to only look in the target directory fro Files until the day changed. I am not sure the rate at which files are written in to these target directories, but be mindful that if a file is add between runs of the listHDFS processor and the day changes between those runs, that file will not get listed using the above EL statement.

Thanks,

Matt

View solution in original post

2 REPLIES 2

avatar
Master Mentor

@Akash S

The ListHDFS processor records state so that only new files are listed. The processor also has a configuration option for recursing subdirectories. You could set the directory to only /MajorData/Location/ and let it list all files from the subdirectories. As new subdirectories are created, the files within those new directories will get listed.

If that does not work for you, the NiFi expression language (EL) statement that you are looking for would look something like this for the directory:

/MajorData/Location/${now():format('yyyy/MM/dd')}

The above would cause Nifi to only look in the target directory fro Files until the day changed. I am not sure the rate at which files are written in to these target directories, but be mindful that if a file is add between runs of the listHDFS processor and the day changes between those runs, that file will not get listed using the above EL statement.

Thanks,

Matt

avatar
Rising Star

Thank you Matt, ListHDFS was a good hint. I was able to accomplish my task with you inputs.