Support Questions

Find answers, ask questions, and share your expertise

FetchHDFS Process to fetch Nested data in HDFS

avatar
Rising Star

Hi All,

I want to fetch the data that is stored in HDFS using FetchHDFS processor .

The folder structure to store our data is like /MajorData/Location/Year/Month/Day/file1.txt (/MajorData/Location/2017/01/01/file1.txt) As the day changes the folder structure will change to /MajorData/Location/2017/01/02/file2.txt

How can I write a Nifi expression which will traverse through all the folders, fetch the data in NiFi?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Akash S

The ListHDFS processor records state so that only new files are listed. The processor also has a configuration option for recursing subdirectories. You could set the directory to only /MajorData/Location/ and let it list all files from the subdirectories. As new subdirectories are created, the files within those new directories will get listed.

If that does not work for you, the NiFi expression language (EL) statement that you are looking for would look something like this for the directory:

/MajorData/Location/${now():format('yyyy/MM/dd')}

The above would cause Nifi to only look in the target directory fro Files until the day changed. I am not sure the rate at which files are written in to these target directories, but be mindful that if a file is add between runs of the listHDFS processor and the day changes between those runs, that file will not get listed using the above EL statement.

Thanks,

Matt

View solution in original post

2 REPLIES 2

avatar
Master Mentor

@Akash S

The ListHDFS processor records state so that only new files are listed. The processor also has a configuration option for recursing subdirectories. You could set the directory to only /MajorData/Location/ and let it list all files from the subdirectories. As new subdirectories are created, the files within those new directories will get listed.

If that does not work for you, the NiFi expression language (EL) statement that you are looking for would look something like this for the directory:

/MajorData/Location/${now():format('yyyy/MM/dd')}

The above would cause Nifi to only look in the target directory fro Files until the day changed. I am not sure the rate at which files are written in to these target directories, but be mindful that if a file is add between runs of the listHDFS processor and the day changes between those runs, that file will not get listed using the above EL statement.

Thanks,

Matt

avatar
Rising Star

Thank you Matt, ListHDFS was a good hint. I was able to accomplish my task with you inputs.