Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi cluster, list files from one of the nodes in the cluster and use fetch files

avatar
Expert Contributor

Hi All,

Thanks a lot this community.

I have a nifi cluster, in one the nodes on a nifi cluster, there is a PHP cron script which create files in a directory. For ingesting these files I research on this community, to my understanding I should use "ListFile processor" on the node on which the file is generated and then use remote process group to input to fetch file processor.

I read on this community that the Listfile processor should be used on Primary nifi node? Is it necessary, if yes, then primary keeps changing if the primary goes down, then the elected primary will not be abe to access the directory.

Or I can use it on one of the nodes regardless of which is node is primary.

I read this link

https://community.hortonworks.com/questions/109247/how-to-distribute-files-on-nifi-cluster-and-proce...

Thanks

Dheeru

1 ACCEPTED SOLUTION

avatar

Hi @dhieru singh

When you add a processor to a NiFi cluster it's deployed on each node but enabled following the two cases:

  • If you set scheduling to primary node, the processor is actif only in the primary node. If the primary node is down, NiFi will chose a new node as a primary node and the processor is activated on this new node.
  • If you set scheduling to all nodes, the processor is enabled on all cluster's nodes.

The ListFile processor lists files local to a NiFi node. So if you use it with primary only scheduling then only primary node lists the directory, and continue to work on generated files. If you use it with all nodes scheduling, each NiFi node list its local files and continue to work on them locally. If you need to distribute files between node then you need to use S2S with remote process group.

You need to understand this and your use case and plan accordingly to avoid data duplication and data loss.

I hope this is helpful.

Thanks

View solution in original post

5 REPLIES 5

avatar
Expert Contributor

So accroding to this link,

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

I need to mount a shared netwrok drive to all nodes, so that each of the nodes will have access and then use listFiles processor on primary node followed up by remote process group. AFter that use a input port with remote process group followed up by fetch files.

In this case, even if the primary changes all the nodes in cluster still have access to netwrok drive.

Please some one can correct if I am wring here?

Thanks

Dheiru

avatar

Hi @dhieru singh

When you add a processor to a NiFi cluster it's deployed on each node but enabled following the two cases:

  • If you set scheduling to primary node, the processor is actif only in the primary node. If the primary node is down, NiFi will chose a new node as a primary node and the processor is activated on this new node.
  • If you set scheduling to all nodes, the processor is enabled on all cluster's nodes.

The ListFile processor lists files local to a NiFi node. So if you use it with primary only scheduling then only primary node lists the directory, and continue to work on generated files. If you use it with all nodes scheduling, each NiFi node list its local files and continue to work on them locally. If you need to distribute files between node then you need to use S2S with remote process group.

You need to understand this and your use case and plan accordingly to avoid data duplication and data loss.

I hope this is helpful.

Thanks

avatar
Expert Contributor

Hi @Abdelkrim Hadjidj Thanks for the response. Appreciate it.

So in order to plan for failover and HA, I need to mount a network drive which is visible/access to all the nodes in nifi cluster, but for the listfileprocessor schedule it to run on primary node. In case failure happens, new primary will be elected and it will start listfiles since it has access to the network location .

and the basically use RPG to distribute files for for further processing and saving it on hdfs.

Is my understanding correct.

Thanks

Dhieru

avatar
Master Guru

@dhieru singh

As ListFile processor will be lists all the files that is just flowfile with attributes(absolute.path,filename.etc)associated with the flowfiles, we will make use of these attributes in FetchFile Processor to do actual fetch of the data.

So when you keep access to your network location for both nifi nodes then If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

for example:-

Consider you have given directory access from only 1 node which is primary node now.

Then primary node(1 node) has listed files in the directory until 10/23/201716:30 and then the primary node got changed to another node(2 node).

Now the primary node is 2 node and our list file processor configured to run only on primary node in this case 2 node try to access those directories to list only the files that got created after 10/23/201716:30 and 2 node wont have access because we have given access to directory only for 1 node in this case the processor throws an error because 2 node(current primary node wont have access to directory).

So we need to have access to directories from both nodes(1,2) and if primary node changes then new primary node will pick up where the old primary node left off and does listing the files from the directories.

avatar
Expert Contributor

@Shu

Thanks for explanation. It is very helpful, appreciate it.

Dhieru