Created 09-29-2016 07:57 AM
Hi all,
In the cluster configuration where i can see which server ingest the data with the processor GetHDFS for example?
thanks
Created 09-29-2016 08:10 AM
The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.
In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.
You can find an example of what I mean at the end of this HCC article:
https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html
Hope this helps.
Created 09-29-2016 08:10 AM
The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.
In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.
You can find an example of what I mean at the end of this HCC article:
https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html
Hope this helps.
Created 09-29-2016 08:48 AM
Thanks @Pierre Villard : i have followed your tuto for setup my cluster, but in my case primary and coordinator are always in the same node. Do you know that wrong ?
Created 09-29-2016 08:50 AM
There is absolutely nothing wrong with having a node as both cluster coordinator and primary node. These are two different roles that can be done on the same node.