- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
NIFI : Cluster Node
- Labels:
-
Apache NiFi
Created ‎09-29-2016 07:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
In the cluster configuration where i can see which server ingest the data with the processor GetHDFS for example?
thanks
Created ‎09-29-2016 08:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.
In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.
You can find an example of what I mean at the end of this HCC article:
https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html
Hope this helps.
Created ‎09-29-2016 08:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The workflow displayed on the canvas is executed on each node of your cluster. Consequently, unless you have configured your GetHDFS processor to run on primary node only (in the scheduling tag of your processor configuration), each node of your cluster will get file from HDFS. This can create race condition and you should set your processor to run on primary node only. In this case, in your cluster page description, you can know which node has been elected as the primary node.
In order to balance the load when getting files from HDFS you may want to use the combination of List/FetchHDFS processors. The ListHDFS processor will create one flow file per listed file with the path of the file, and the FetchHDFS processor will actually get the file from HDFS. By using a Remote Process Group in your canvas you can actually evenly spread your flow files between your nodes and each node will be assigned different files to fetch from HDFS.
You can find an example of what I mean at the end of this HCC article:
https://community.hortonworks.com/content/kbentry/55349/nifi-100-unsecured-cluster-setup.html
Hope this helps.
Created ‎09-29-2016 08:48 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Pierre Villard : i have followed your tuto for setup my cluster, but in my case primary and coordinator are always in the same node. Do you know that wrong ?
Created ‎09-29-2016 08:50 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is absolutely nothing wrong with having a node as both cluster coordinator and primary node. These are two different roles that can be done on the same node.
