Support Questions

Paul Yang · ‎10-16-2016

Hi:

From google :

If you are using NiFi 1.0 you can use the GenerateTableFetch processor. It allows you to choose the "page" (aka partition) size, and will generate SQL statements, each of which will grab one "page" of data. You can route those into ExecuteSQL and it will retrieve smaller sets of results at a time. If you have a NiFi cluster, you can route GenerateTableFetch into a Remote Process Group that points at an Input Port on the same cluster, (which will distribute the SQL statements across the cluster), then the Input Port can be connected to the ExecuteSQL. This allows you to fetch rows from a table in parallel

I have three nodes on my nifi cluster, I follow the post, put GenerateTableFetch on primary node(test01) to execute, and then send to remote process group on the same cluster. then output port to ExecuteSQL.

But the actual behavior is ExecuteSQL was execute just in one node (test02, or test 03, or tes01).

My question is how to fetch rows from a table in parallel(test02, and test 03, and tes01) .

Thanks

MattWho · ‎11-02-2016

@Paul Yang

1. There is an existing open Jira for being able to adjust the batch size of Site-to-Site. (https://issues.apache.org/jira/browse/NIFI-1202)

2. NiFi does not restrict how many RPGs can be added to the canvas. What is important to understand is that NiFi Nodes do not know about one another. Each runs the dataflow. When using RPGs to pull data from an output port, every node is running that RPG and every node is requesting FlowFiles. When one of those nodes connects the cluster informs that connecting instances that x number of FlowFile are currently queued to that output port and that Node will pull them all. so you get much better load-balance behavior forma push to an input port (yet still done in batches of 100).

3. Two suggestions come to mind:

a. Reduce the configured "partition size" value in your GenerateTableFetch processor so more FlowFiles are generated which should then get better load balanced across you nodes.

b. Instead of using S2S, build a load-balanced dataflow that is hard-coded to deliver data to each node as follows:

View solution in original post

zzeng · ‎01-15-2024

Update:

In the latest NiFi, now we can directly connect "GenerateTableFetch" and and "ExecuteSQL" with a connection.

NiFi is evolving 🙂

Cloudera Community

Support Questions

How to fetch rows from a table in parallel when nifi cluster

Using Sqoop to fetch many tables in parallel

Delete Row Key(s) using DeleteHBaseRow processor i...

Fetch objects from an IBM Cloud S3 bucket using Ap...

NiFi Cluster and Load Balancer

HDF/NiFi to convert row-formatted text files to co...

DELETE rows in table, how HDFS file size is impact...

RDBMS to Hive using NiFi (small-medium tables)

pyspark get row value from row object

NIFI Site to Site connection between Clusters

Offload NiFi Cluster Nodes using the NiFi Toolkit ...