Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NIFI : Cluster Node

avatar
Rising Star

Hi all,

In the cluster configuration where i can see which server ingest the data with the processor GetHDFS for example?

thanks

1 ACCEPTED SOLUTION

avatar

@mayki wogno

The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.

In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.

You can find an example of what I mean at the end of this HCC article:

https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html

Hope this helps.

View solution in original post

3 REPLIES 3

avatar

@mayki wogno

The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.

In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.

You can find an example of what I mean at the end of this HCC article:

https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html

Hope this helps.

avatar
Rising Star

Thanks @Pierre Villard : i have followed your tuto for setup my cluster, but in my case primary and coordinator are always in the same node. Do you know that wrong ?

avatar

There is absolutely nothing wrong with having a node as both cluster coordinator and primary node. These are two different roles that can be done on the same node.