Support Questions

salome_tkhilais · ‎09-15-2017

I want to read some file (which are put in hdfs directory) and i want to use ListHdfs processor for it , there are several questions i am interested in:

when i start listHdfs procesoor it will capture all files from directory and if i change it's state then( i mean i stop the processor) and then start it it again it willl take only those files whcih were put in dircetory recentrly or all files which are in directory?

Shu_ashu · ‎09-15-2017

Hi @sally sally, List Hdfs processor are developed as store the last state..
i.e when you configure ListHDFS processor you are going to specify directory name in properties. once the processor lists all the files existed in that directory at the time it will stores the state as maximum file time when it got stored into HDFS. you can view the state info by clicking on view state button.

if you want to clear the state then you need to get into view state and click on clear the state.

2. so once it saves the state in listhdfs processor, if you are running the processor by scheduling as cron(or)timer driven it will only checks for the new files after the state timestamp.

Note:- as we are running ListHDFS on primary node only, but this state value will be stored across all the nodes of NiFi cluster as primary node got changed, there won't be any issues regarding duplicates.

Example:-

hadoop fs -ls /user/yashu/test/ Found 1 items
-rw-r--r--   3 yash hdfs          3 2017-09-15 16:16 /user/yashu/test/part1.txt

when i configure ListHDFS processor to list all the files in the above directory

if you see the state of ListHDFS processor that should be same as when part1.txt got stored in HDFS in our case that should be

 2017-09-15 16:16

it would be unix time in milliseconds when we convert the state time to date time format
that should be

Unixtime in milliseconds:- 1505506613479
Timestamp               :- 2017-09-15 16:16:53

so the processor has stored the state, when it will run again it will lists only the new files that got stored after the state timestamp in to the directory and updates the state with new state time (i.e maximum file created in hadoop directory).

View solution in original post

Wynner · ‎09-15-2017

@sally sally

The processor will only list the files which were not included in the first listing it created.

MattWho · ‎09-15-2017

In order to have listing start over again, you would need to perform the following:

1. Open "Component State" UI by right clicking on the listHDFS processor and select "view state".

2. Within that UI you will see a blue link "Clear state" which will clear the currentlr retained state.

Shu_ashu · ‎09-15-2017

Hi @sally sally, List Hdfs processor are developed as store the last state..
i.e when you configure ListHDFS processor you are going to specify directory name in properties. once the processor lists all the files existed in that directory at the time it will stores the state as maximum file time when it got stored into HDFS. you can view the state info by clicking on view state button.

if you want to clear the state then you need to get into view state and click on clear the state.

2. so once it saves the state in listhdfs processor, if you are running the processor by scheduling as cron(or)timer driven it will only checks for the new files after the state timestamp.

Note:- as we are running ListHDFS on primary node only, but this state value will be stored across all the nodes of NiFi cluster as primary node got changed, there won't be any issues regarding duplicates.

Example:-

hadoop fs -ls /user/yashu/test/ Found 1 items
-rw-r--r--   3 yash hdfs          3 2017-09-15 16:16 /user/yashu/test/part1.txt

when i configure ListHDFS processor to list all the files in the above directory

if you see the state of ListHDFS processor that should be same as when part1.txt got stored in HDFS in our case that should be

 2017-09-15 16:16

it would be unix time in milliseconds when we convert the state time to date time format
that should be

Unixtime in milliseconds:- 1505506613479
Timestamp               :- 2017-09-15 16:16:53

so the processor has stored the state, when it will run again it will lists only the new files that got stored after the state timestamp in to the directory and updates the state with new state time (i.e maximum file created in hadoop directory).

Cloudera Community

Support Questions

Nifi:How does ListHdfs processor work?

How does a cross-realm trust work?

MergeContent Processor Inner Workings

Support Video: How does Kafka ACLs work?

Publish_Kafka_1_0 processor not stops working afte...

NIFI : listHDFS processor Keytab issue

EnforceOrder processor doesn't work.

How to set a processor to DEBUG when on Cloudera D...

ExecuteSQL dynamic query does not work but no erro...

Build Custom Nifi Processor

Regex doesn't work on ExtractText Processor?