Created 09-15-2017 06:47 PM
I want to read some file (which are put in hdfs directory) and i want to use ListHdfs processor for it , there are several questions i am interested in:
Created on 09-15-2017 08:46 PM - edited 08-17-2019 11:37 PM
Hi @sally sally, List Hdfs processor are developed as store the last state..
i.e when you configure ListHDFS processor you are going to specify directory name in properties. once the processor lists all the files existed in that directory at the time it will stores the state as maximum file time when it got stored into HDFS. you can view the state info by clicking on view state button.
if you want to clear the state then you need to get into view state and click on clear the state.
2. so once it saves the state in listhdfs processor, if you are running the processor by scheduling as cron(or)timer driven it will only checks for the new files after the state timestamp.
Note:- as we are running ListHDFS on primary node only, but this state value will be stored across all the nodes of NiFi cluster as primary node got changed, there won't be any issues regarding duplicates.
Example:-
hadoop fs -ls /user/yashu/test/ Found 1 items -rw-r--r-- 3 yash hdfs 3 2017-09-15 16:16 /user/yashu/test/part1.txt
when i configure ListHDFS processor to list all the files in the above directory
if you see the state of ListHDFS processor that should be same as when part1.txt got stored in HDFS in our case that should be
2017-09-15 16:16
it would be unix time in milliseconds when we convert the state time to date time format
that should be
Unixtime in milliseconds:- 1505506613479
Timestamp :- 2017-09-15 16:16:53
so the processor has stored the state, when it will run again it will lists only the new files that got stored after the state timestamp in to the directory and updates the state with new state time (i.e maximum file created in hadoop directory).
Created 09-15-2017 08:19 PM
The processor will only list the files which were not included in the first listing it created.
Created 09-15-2017 08:39 PM
In order to have listing start over again, you would need to perform the following:
1. Open "Component State" UI by right clicking on the listHDFS processor and select "view state".
2. Within that UI you will see a blue link "Clear state" which will clear the currentlr retained state.
Created on 09-15-2017 08:46 PM - edited 08-17-2019 11:37 PM
Hi @sally sally, List Hdfs processor are developed as store the last state..
i.e when you configure ListHDFS processor you are going to specify directory name in properties. once the processor lists all the files existed in that directory at the time it will stores the state as maximum file time when it got stored into HDFS. you can view the state info by clicking on view state button.
if you want to clear the state then you need to get into view state and click on clear the state.
2. so once it saves the state in listhdfs processor, if you are running the processor by scheduling as cron(or)timer driven it will only checks for the new files after the state timestamp.
Note:- as we are running ListHDFS on primary node only, but this state value will be stored across all the nodes of NiFi cluster as primary node got changed, there won't be any issues regarding duplicates.
Example:-
hadoop fs -ls /user/yashu/test/ Found 1 items -rw-r--r-- 3 yash hdfs 3 2017-09-15 16:16 /user/yashu/test/part1.txt
when i configure ListHDFS processor to list all the files in the above directory
if you see the state of ListHDFS processor that should be same as when part1.txt got stored in HDFS in our case that should be
2017-09-15 16:16
it would be unix time in milliseconds when we convert the state time to date time format
that should be
Unixtime in milliseconds:- 1505506613479
Timestamp :- 2017-09-15 16:16:53
so the processor has stored the state, when it will run again it will lists only the new files that got stored after the state timestamp in to the directory and updates the state with new state time (i.e maximum file created in hadoop directory).