- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to fetch rows from a table in parallel when nifi cluster
- Labels:
-
Apache NiFi
Created ‎10-16-2016 02:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
From google :
If you are using NiFi 1.0 you can use the GenerateTableFetch processor. It allows you to choose the "page" (aka partition) size, and will generate SQL statements, each of which will grab one "page" of data. You can route those into ExecuteSQL and it will retrieve smaller sets of results at a time. If you have a NiFi cluster, you can route GenerateTableFetch into a Remote Process Group that points at an Input Port on the same cluster, (which will distribute the SQL statements across the cluster), then the Input Port can be connected to the ExecuteSQL. This allows you to fetch rows from a table in parallel
I have three nodes on my nifi cluster, I follow the post, put GenerateTableFetch on primary node(test01) to execute, and then send to remote process group on the same cluster. then output port to ExecuteSQL.
But the actual behavior is ExecuteSQL was execute just in one node (test02, or test 03, or tes01).
My question is how to fetch rows from a table in parallel(test02, and test 03, and tes01) .
Thanks
Created on ‎11-02-2016 07:54 PM - edited ‎08-19-2019 02:10 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. There is an existing open Jira for being able to adjust the batch size of Site-to-Site. (https://issues.apache.org/jira/browse/NIFI-1202)
2. NiFi does not restrict how many RPGs can be added to the canvas. What is important to understand is that NiFi Nodes do not know about one another. Each runs the dataflow. When using RPGs to pull data from an output port, every node is running that RPG and every node is requesting FlowFiles. When one of those nodes connects the cluster informs that connecting instances that x number of FlowFile are currently queued to that output port and that Node will pull them all. so you get much better load-balance behavior forma push to an input port (yet still done in batches of 100).
3. Two suggestions come to mind:
a. Reduce the configured "partition size" value in your GenerateTableFetch processor so more FlowFiles are generated which should then get better load balanced across you nodes.
b. Instead of using S2S, build a load-balanced dataflow that is hard-coded to deliver data to each node as follows:
Created ‎01-15-2024 02:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Update:
In the latest NiFi, now we can directly connect "GenerateTableFetch" and and "ExecuteSQL" with a connection.
NiFi is evolving 🙂

- « Previous
-
- 1
- 2
- Next »