Support Questions

ajaykumardev32 · ‎04-22-2025

Hi Community,

I'm working on a NiFi setup where I use a dedicated template to track the status of FlowFiles from various other templates. The status of each FlowFile is logged in a specific pattern, and I'm using this pattern to extract and persist status information.

Here's a brief overview of the current approach:

TailFile Processor reads log entries from a specific log file.
SplitText Processor splits the log content line by line.
ExtractGrok Processor extracts relevant fields using a defined Grok pattern.
ReplaceText Processor restructures the data to a desired format (e.g., JSON).
PutDatabaseRecord Processor stores the structured data into a database.

Problems Faced:

Queue Build-Up & Performance Bottleneck:
- TailFile often brings in large chunks of data, especially under high log volume.
- The SplitText processor cannot keep up with the rate of incoming data.
- This leads to large unprocessed FlowFiles piling up in the queue.
FlowFile Explosion & Choking:
- Once a large FlowFile is split, it results in a burst of many smaller FlowFiles.
- This sudden expansion causes congestion and chokes downstream processors.
Repository Storage Issues:
- The above behavior leads to excessive usage of the FlowFile Repository, Content Repository, and Provenance Repository.
- Over time, this is causing storage concerns and performance degradation

My Question:

Is there a way to optimize this flow to:

Reduce the memory and storage pressure on NiFi repositories?
Handle incoming log data more efficiently without overwhelming the system?
Or, is there a better architectural pattern to achieve log-based FlowFile tracking across templates?

Any guidance or best practices would be greatly appreciated.

Thanks!

MattWho · ‎04-22-2025

@ajaykumardev32

I would try to redesign your dataflow to avoid splitting the FlowFiles produced by tailFile processor. NiFi FlowFile content is immutable (can not be modified once created). Anytime the content of a FlowFile is modified, the new modified content is written to a new NiFi content claim. If the processor has an "Original" relationship, an entirely new FlowFile is created (both metadata and content). Those without "original" relationship that modify FlowFile content will simply update the existing FlowFile's metadata to point to new content claim. So your SplitText processor is producing a lot of new FlowFiles. You then have inefficient thread usage downstream where processor are now executing against many small FlowFiles.

As far as Provenance repository goes, you can configure the max amount of storage it can use before purging older provenance events. Content and FlowFile repositories should not be on same disk since it is possible for content repository to fill the disk to 100%. You want to protect yoru FlowFile repository from filling to 100% by having it on a different physical or logical drive.

Try utilizing the available "Record" based processors instead to avoid splitting FlowFiles and to do the record conversion/transform/modification. In your case take a look at these record processors to see if they can be used in your use case:

You are already using a "record" based processor to write to your destination DB.

Other strategies involve adjusting the "Max Timer Driven Thread" and processor "concurrent tasks" settings. You'll need to carefully monitor cpu load average as you make incremental adjustments. If you max out your cpu, there is no gain from adjusting higher anymore. Setting "Concurrent tasks" too high on any one processor can actually lead to worse performance overall in your dataflow. So small increments and monitor is the proper path to optimization in this area.

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

Cloudera Community

Support Questions

Optimize NiFi Flow for Log-Based FlowFile Status Tracking

Problems Faced: