Support Questions

dhieru · ‎01-18-2018

Hi All,

Thanks a lot to this awesome community.

My data flow looks like listentcp -> mergecontent -> updateattribute -> puthdfs

I have reveiwed the number of concurrent tasks and increased them however still all the flow gets overwhelemed (see attached image )by data spikes.Any suggestions to handle this.

capture2.png

Thanks

Dhieru

MattWho · ‎01-18-2018

@dhieru singh

When you say "all processors" are being overwhelmed, are you saying connection between every single processor is filling and triggering back pressure in your dataflow?

Have you looked that the resource of your hardware running your NiFi instance?

Is CPU becoming, memory, and/or disk I/O becoming saturated during these spikes? If so, there is not much with in the configuration of NiFi that can help much here. In a case like this it would require that you expand your NiFi into a cluster.

You then have two options for your ListenTCP feed.

1. Run the ListenTCP processor on all nodes and place an external load-balancer to distribute the TCP traffic to every node.

2. Have the ListenTCP processor receive data on only one node, but immediately feed the success relationship form that ListenTCP processor to a Remote Process Group (RPG) that can be used to redistribute the received FlowFiles to all nodes in your cluster to spread out the work being done by the rest of the processors in your dataflow(s)

If your resources are not saturated, make sure you have allocated enough "Max Timer Driven Threads" to your NiFi instance so that all processors are fully utilizing those server CPU resources. Defaults for NiFi are only 10. The Max Timer Driven Thread count can be adjusted in the "Controller settings" UI found within the hamburger menu in the upper right corner.

Note: do not adjust defaults for Event Driven Thread Count. This just increase a thread pool that i not used by default.

If disk I/O is high, following best practices to make sure the NiFi logs, Provenance repository(s), Content repository(s), and FlowFile repository are all located on their own physical disks would help here.

Thank you,

Matt

View solution in original post

MattWho · ‎01-18-2018

@dhieru singh

When you say "all processors" are being overwhelmed, are you saying connection between every single processor is filling and triggering back pressure in your dataflow?

Have you looked that the resource of your hardware running your NiFi instance?

Is CPU becoming, memory, and/or disk I/O becoming saturated during these spikes? If so, there is not much with in the configuration of NiFi that can help much here. In a case like this it would require that you expand your NiFi into a cluster.

You then have two options for your ListenTCP feed.

1. Run the ListenTCP processor on all nodes and place an external load-balancer to distribute the TCP traffic to every node.

2. Have the ListenTCP processor receive data on only one node, but immediately feed the success relationship form that ListenTCP processor to a Remote Process Group (RPG) that can be used to redistribute the received FlowFiles to all nodes in your cluster to spread out the work being done by the rest of the processors in your dataflow(s)

If your resources are not saturated, make sure you have allocated enough "Max Timer Driven Threads" to your NiFi instance so that all processors are fully utilizing those server CPU resources. Defaults for NiFi are only 10. The Max Timer Driven Thread count can be adjusted in the "Controller settings" UI found within the hamburger menu in the upper right corner.

Note: do not adjust defaults for Event Driven Thread Count. This just increase a thread pool that i not used by default.

If disk I/O is high, following best practices to make sure the NiFi logs, Provenance repository(s), Content repository(s), and FlowFile repository are all located on their own physical disks would help here.

Thank you,

Matt

MattWho · ‎01-18-2018

@dhieru singh

Yo may also consider a two phase approach to mergeContent processor. By using two MergeContent processors in series, you will reduce your NiFi's heap usage and the number of FLowfiles needing to be merged in each iteration.

For example, you may have first mergeContent merge based on number of FlowFiles (min 10000 and max 15000). Then have the second merge on size (assuming 64 MB to 128 MB for your HDFS, you would set min size to 64 MB and max to 128 MB).

Thanks,

Matt

dhieru · ‎01-18-2018

@Matt Clarke Thanks for the response, I have already done all the above, the RPG one is the one I have not tried but looks like a good suggestion, because the data spike was on one node, the color of that node was green

Cloudera Community

Support Questions

nifi data flow handling data spikes in listentcp

Data flow enrichment with NiFi part 2 : LookupAttr...

Data flow enrichment with NiFi part 3: LookupRecor...

Big Data DevOps: Apache NiFi Flow Versioning and...

Apache Nifi (aka HDF) data flow across data center

Image Data Flow for Industrial Imaging

NiFi Error Handling - Design Pattern

Record-Oriented Data with NiFi

Bringing data storage and data flow closer togethe...

Change Data Capture using NiFi

Organize your data flow