Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi:How does ListHdfs processor work?

avatar
Contributor

I want to read some file (which are put in hdfs directory) and i want to use ListHdfs processor for it , there are several questions i am interested in:

  1. when i start listHdfs procesoor it will capture all files from directory and if i change it's state then( i mean i stop the processor) and then start it it again it willl take only those files whcih were put in dircetory recentrly or all files which are in directory?
1 ACCEPTED SOLUTION

avatar
Master Guru

Hi @sally sally, List Hdfs processor are developed as store the last state..
i.e when you configure ListHDFS processor you are going to specify directory name in properties. once the processor lists all the files existed in that directory at the time it will stores the state as maximum file time when it got stored into HDFS. you can view the state info by clicking on view state button.

40382-state-listhdfs.png

if you want to clear the state then you need to get into view state and click on clear the state.

40383-clear-state.png

2. so once it saves the state in listhdfs processor, if you are running the processor by scheduling as cron(or)timer driven it will only checks for the new files after the state timestamp.

Note:- as we are running ListHDFS on primary node only, but this state value will be stored across all the nodes of NiFi cluster as primary node got changed, there won't be any issues regarding duplicates.

Example:-

hadoop fs -ls /user/yashu/test/ Found 1 items
-rw-r--r--   3 yash hdfs          3 2017-09-15 16:16 /user/yashu/test/part1.txt

when i configure ListHDFS processor to list all the files in the above directory

40384-listhdfs-config.png

if you see the state of ListHDFS processor that should be same as when part1.txt got stored in HDFS in our case that should be

 2017-09-15 16:16

40385-state.png

it would be unix time in milliseconds when we convert the state time to date time format
that should be

Unixtime in milliseconds:- 1505506613479
Timestamp :- 2017-09-15 16:16:53

so the processor has stored the state, when it will run again it will lists only the new files that got stored after the state timestamp in to the directory and updates the state with new state time (i.e maximum file created in hadoop directory).

View solution in original post

3 REPLIES 3

avatar
@sally sally

The processor will only list the files which were not included in the first listing it created.

avatar
Master Mentor

In order to have listing start over again, you would need to perform the following:

1. Open "Component State" UI by right clicking on the listHDFS processor and select "view state".

2. Within that UI you will see a blue link "Clear state" which will clear the currentlr retained state.

avatar
Master Guru

Hi @sally sally, List Hdfs processor are developed as store the last state..
i.e when you configure ListHDFS processor you are going to specify directory name in properties. once the processor lists all the files existed in that directory at the time it will stores the state as maximum file time when it got stored into HDFS. you can view the state info by clicking on view state button.

40382-state-listhdfs.png

if you want to clear the state then you need to get into view state and click on clear the state.

40383-clear-state.png

2. so once it saves the state in listhdfs processor, if you are running the processor by scheduling as cron(or)timer driven it will only checks for the new files after the state timestamp.

Note:- as we are running ListHDFS on primary node only, but this state value will be stored across all the nodes of NiFi cluster as primary node got changed, there won't be any issues regarding duplicates.

Example:-

hadoop fs -ls /user/yashu/test/ Found 1 items
-rw-r--r--   3 yash hdfs          3 2017-09-15 16:16 /user/yashu/test/part1.txt

when i configure ListHDFS processor to list all the files in the above directory

40384-listhdfs-config.png

if you see the state of ListHDFS processor that should be same as when part1.txt got stored in HDFS in our case that should be

 2017-09-15 16:16

40385-state.png

it would be unix time in milliseconds when we convert the state time to date time format
that should be

Unixtime in milliseconds:- 1505506613479
Timestamp :- 2017-09-15 16:16:53

so the processor has stored the state, when it will run again it will lists only the new files that got stored after the state timestamp in to the directory and updates the state with new state time (i.e maximum file created in hadoop directory).