Support Questions

maykiwogno · ‎09-29-2016

Hi all,

In the cluster configuration where i can see which server ingest the data with the processor GetHDFS for example?

thanks

pvillard · ‎09-29-2016

The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.

In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.

You can find an example of what I mean at the end of this HCC article:

https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html

Hope this helps.

View solution in original post

pvillard · ‎09-29-2016

@mayki wogno

The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.

In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.

You can find an example of what I mean at the end of this HCC article:

https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html

Hope this helps.

maykiwogno · ‎09-29-2016

Thanks @Pierre Villard : i have followed your tuto for setup my cluster, but in my case primary and coordinator are always in the same node. Do you know that wrong ?

pvillard · ‎09-29-2016

There is absolutely nothing wrong with having a node as both cluster coordinator and primary node. These are two different roles that can be done on the same node.

Cloudera Community

Support Questions

NIFI : Cluster Node