Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Clarification needed on transfer of data from linux box to hdfs using Nifi by not pulling same data repeatedly.

New Contributor


Am very new to Nifi, We are trying to load data receiving from an external system on a daily basis. Once the data ingestion is completed, the data is processed using hive tables through a oozie workflow. I have explained below the steps we are following in data ingestion. kindly let us know if we are following the right approach to avoid Nifi not duplicating any data and be stable.

1. we are receiving data from an external system on daily basis in four different feeds. This is a daily batch process.

eg: tabl1yyyymmdd, table2yyyymmdd, table3yyyymmdd and table4yyyymmdd.

2. The above mentioned four feeds is dropped into folder named source is linux box where Nifi is also installed.

3. We created a processes group Data_Load in Nifi

3. We are moving data from linux folder to Hdfs folder named unprocessed by creating the below Nifi flow inside the processor group Data_Load

ListFile -> FetchFile ->PutHdfs processor

4. Then we are moving the data from HDFS folder unprocessed to a another HDFS folder processed by creating the below Nifi flow inside the processor group Data_Load.

ListHdfs -> FetchHdfs -> PutHdfs processor.

Note: The HDFS folder unprocessed will have complete data. The HDFS folder processed will have only one day data. The previous day data will be deleted through oozie workflow once the data processing is completed.

Kindly let us know whether the above mentioned process of data ingestion using Nifi has any flaw in their design to be stable and to avoid Nifi not duplicating any data.



The ListFile and ListHdfs processors both leverage NiFi's state management capabilities to prevent duplication of data. In particular, from the ListHdfs docs (same applies in essence to ListFile case):

after performing a listing of HDFS files, the timestamp of the newest file is stored, along with the filenames of all files that share that same timestamp. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Please note it's important that these List* processors run on the Primary Node only. With this design we can ensure that the Fetch* processors that run on each cluster node will each process a distinct partition of the file listing.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.