Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to fetch rows from a table in parallel when nifi cluster

avatar
Expert Contributor

Hi:

From google :

If you are using NiFi 1.0 you can use the GenerateTableFetch processor. It allows you to choose the "page" (aka partition) size, and will generate SQL statements, each of which will grab one "page" of data. You can route those into ExecuteSQL and it will retrieve smaller sets of results at a time. If you have a NiFi cluster, you can route GenerateTableFetch into a Remote Process Group that points at an Input Port on the same cluster, (which will distribute the SQL statements across the cluster), then the Input Port can be connected to the ExecuteSQL. This allows you to fetch rows from a table in parallel

I have three nodes on my nifi cluster, I follow the post, put GenerateTableFetch on primary node(test01) to execute, and then send to remote process group on the same cluster. then output port to ExecuteSQL.

But the actual behavior is ExecuteSQL was execute just in one node (test02, or test 03, or tes01).

My question is how to fetch rows from a table in parallel(test02, and test 03, and tes01) .

Thanks

1 ACCEPTED SOLUTION

avatar
Super Mentor

@Paul Yang

1. There is an existing open Jira for being able to adjust the batch size of Site-to-Site. (https://issues.apache.org/jira/browse/NIFI-1202)

2. NiFi does not restrict how many RPGs can be added to the canvas. What is important to understand is that NiFi Nodes do not know about one another. Each runs the dataflow. When using RPGs to pull data from an output port, every node is running that RPG and every node is requesting FlowFiles. When one of those nodes connects the cluster informs that connecting instances that x number of FlowFile are currently queued to that output port and that Node will pull them all. so you get much better load-balance behavior forma push to an input port (yet still done in batches of 100).

3. Two suggestions come to mind:

a. Reduce the configured "partition size" value in your GenerateTableFetch processor so more FlowFiles are generated which should then get better load balanced across you nodes.

b. Instead of using S2S, build a load-balanced dataflow that is hard-coded to deliver data to each node as follows:

9065-screen-shot-2016-11-02-at-35115-pm.png

View solution in original post

10 REPLIES 10

avatar
Expert Contributor

Update:

In the latest NiFi, now we can directly connect "GenerateTableFetch" and  and "ExecuteSQL" with a connection.

NiFi is evolving 🙂

zzeng_1-1705313463593.png

 

zzeng_0-1705313415379.png