Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Improving Nifi Perfomance

avatar
New Contributor

Hi,

 

I'm new to Nifi and currently trying to understand how to improve its performance.  In this use case I am getting large zip files (30 - 700 GB) from a network share using the FetchFile processor and then unpacking with UnpackContent.

 

Files that are in the 500GB+ range are taking several hours to move onto the unpack stage.  Files are eventually make their way through but I'm not sure on the best way to improve performance.

 

The server is beefy with 1TB RAM, plenty of disk space, 40 core CPU.  Utilisation across CPU/Mem/Disk/Network I/O all seem under used whilst the processor is running.

 

I've upped the max timer/event thread counts to 80/5 respectively but that doesn't seem to have made any  difference.  Is tackling the jvm min/max limits a good option to go for next?

 

thanks for any guidance.

1 REPLY 1

avatar
Super Mentor

@PurpleK 

It is not clear what you mean when you say "Files that are in the 500GB+ range are taking several hours to move onto the unpack stage.".

So FlowFile(s) are released to a downstream connection until processing of the source file is complete.  The source file will still be represented in the queued count of the connection feeding a processors even while that processor is executing on that FlowFile.
When you moving on to unpack stage, are you referring to some upstream processor feeding the connection to the UnpackContent processor taking awhile to queue some FlowFile on that downstream connection, or are you referring to once the file is queued it take awhile for unpack to complete execution on it creating on the unpacked FlowFiles and then remove original zip from upstream connection queue?
Step 1 is identify the exact place(s) it is slow.

Adding additional concurrent tasks to a processor has no impact on speeding up the execution on a specific source FlowFile.  1 thread get assigned to each execution of the processor and in the case of unpackContent, each tread executes against 1 FlowFile from upstream connection.  Adding multiple concurrent tasks will allow multiple upstream FlowFiles to be processed concurrently.  IMPORTANT: Increment concurrent tasks slowly while monitoring CPU load averages.  Adding too many concurrent tasks on any one processor can impact other processors in your dataflow

Event Driven Processor scheduling strategy is deprecated and will eventually go away (mostly like next major release) and should not be used.  So increasing the Max Event driven Thread count under controller settings will have no impact unless you are using that strategy in your flow.  It does create event threads, but they would not consume CPU if you are not using event driven scheduling anywhere in your dataflow(s).

NiFi is a data agnostic service, meaning it can handle any stat type in its raw binary format.  NiFi can do this because it wraps that binary content in a NiFi FlowFile.  A NiFi FlowFile is what you see moving form processor to processor in your dataflows and int becomes the responsibility of the processor to understand the FlowFile's content should it need to read it.  I bring this up because a FlowFile adds a small bit of overhead as it has to generate FlowFile metadata for every FlowFile created.

When it comes to your 500GB+ zip files...
1. Do they consist of many small and/or large files? NiFi must create a FlowFile for each file that results from unpacking the original zip.
2. Do you see a lot of Java Garbage Collection (GC) pauses happening?  All GC is stop the world.  GC is normal operation or any JVM, but if GC is happening very often it can impact flow performance with constant pauses due to stop the world nature of GC.  The larger the JVM memory that longer the stop the world event will be.
3. Any exceptions in your niif-app.log?  

You may also find this article helpful, it is old but majority of guidance is still very valid.  Latest NiFi version support Java 8 and Java11, so you can ignore the G1GC recommendations if your are using Java 11.
https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-hi...

 

Hopefully the concurrent tasks on your processor(s) excuting against the content of lareg FlowFiles will help you better utilize your hardware and achieve overall better throughput.  Keep in mind that it only allows concurrent execution on multiple source FlowFiles, so will not improve speed at which a single FlowFile will be processed by a given processor.


If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Thank you,

Matt