Created 10-23-2017 05:25 PM
Hi All,
Thanks a lot this community.
I have a nifi cluster, in one the nodes on a nifi cluster, there is a PHP cron script which create files in a directory. For ingesting these files I research on this community, to my understanding I should use "ListFile processor" on the node on which the file is generated and then use remote process group to input to fetch file processor.
I read on this community that the Listfile processor should be used on Primary nifi node? Is it necessary, if yes, then primary keeps changing if the primary goes down, then the elected primary will not be abe to access the directory.
Or I can use it on one of the nodes regardless of which is node is primary.
I read this link
Thanks
Dheeru
Created 10-23-2017 07:05 PM
When you add a processor to a NiFi cluster it's deployed on each node but enabled following the two cases:
The ListFile processor lists files local to a NiFi node. So if you use it with primary only scheduling then only primary node lists the directory, and continue to work on generated files. If you use it with all nodes scheduling, each NiFi node list its local files and continue to work on them locally. If you need to distribute files between node then you need to use S2S with remote process group.
You need to understand this and your use case and plan accordingly to avoid data duplication and data loss.
I hope this is helpful.
Thanks
Created 10-23-2017 06:36 PM
So accroding to this link,
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
I need to mount a shared netwrok drive to all nodes, so that each of the nodes will have access and then use listFiles processor on primary node followed up by remote process group. AFter that use a input port with remote process group followed up by fetch files.
In this case, even if the primary changes all the nodes in cluster still have access to netwrok drive.
Please some one can correct if I am wring here?
Thanks
Dheiru
Created 10-23-2017 07:05 PM
When you add a processor to a NiFi cluster it's deployed on each node but enabled following the two cases:
The ListFile processor lists files local to a NiFi node. So if you use it with primary only scheduling then only primary node lists the directory, and continue to work on generated files. If you use it with all nodes scheduling, each NiFi node list its local files and continue to work on them locally. If you need to distribute files between node then you need to use S2S with remote process group.
You need to understand this and your use case and plan accordingly to avoid data duplication and data loss.
I hope this is helpful.
Thanks
Created 10-23-2017 07:46 PM
Hi @Abdelkrim Hadjidj Thanks for the response. Appreciate it.
So in order to plan for failover and HA, I need to mount a network drive which is visible/access to all the nodes in nifi cluster, but for the listfileprocessor schedule it to run on primary node. In case failure happens, new primary will be elected and it will start listfiles since it has access to the network location .
and the basically use RPG to distribute files for for further processing and saving it on hdfs.
Is my understanding correct.
Thanks
Dhieru
Created 10-23-2017 08:40 PM
As ListFile processor will be lists all the files that is just flowfile with attributes(absolute.path,filename.etc)associated with the flowfiles, we will make use of these attributes in FetchFile Processor to do actual fetch of the data.
So when you keep access to your network location for both nifi nodes then If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.
for example:-
Consider you have given directory access from only 1 node which is primary node now.
Then primary node(1 node) has listed files in the directory until 10/23/201716:30 and then the primary node got changed to another node(2 node).
Now the primary node is 2 node and our list file processor configured to run only on primary node in this case 2 node try to access those directories to list only the files that got created after 10/23/201716:30 and 2 node wont have access because we have given access to directory only for 1 node in this case the processor throws an error because 2 node(current primary node wont have access to directory).
So we need to have access to directories from both nodes(1,2) and if primary node changes then new primary node will pick up where the old primary node left off and does listing the files from the directories.
Created 10-23-2017 09:49 PM