Support Questions

Find answers, ask questions, and share your expertise

Performance diff between single big file vs multiple smaller files

avatar
Expert Contributor

I have a nifi flow. I observe that if input file is split into smaller files and fed into the flow one by one, then overall time taken(sum of time taken for individual files) is considerably low compared to when I feed single big file.

What can be a possible cause for this performance difference?

Note:

Flow has many processors that use avro readers/writers.

I calculate time using following in a LogMessage processor:

${now():toNumber():minus(${lineageStartDate}):format("HH:mm:‌​ss", "GMT")}

1 ACCEPTED SOLUTION

avatar
Master Mentor

@manishg 

The first thing that comes to mind is JVM heap.   You may want to collect and look at garbage collection data with large files versus small files.  

Second would be identifying which processor(s) the largest FlowFiles spend the most time at.  For this I would suggest looking at the provenance lineage for the large FlowFiles. There is a slide bar at bottom of that lineage graph that you can scroll to see progression of FlowFile through the lineage tree.   What processor follow did the FlowFile spend longest at and how are they configured?

Hope this helps,

Matt

View solution in original post

3 REPLIES 3

avatar
Master Mentor

@manishg 

The first thing that comes to mind is JVM heap.   You may want to collect and look at garbage collection data with large files versus small files.  

Second would be identifying which processor(s) the largest FlowFiles spend the most time at.  For this I would suggest looking at the provenance lineage for the large FlowFiles. There is a slide bar at bottom of that lineage graph that you can scroll to see progression of FlowFile through the lineage tree.   What processor follow did the FlowFile spend longest at and how are they configured?

Hope this helps,

Matt

avatar
Expert Contributor

Where is this graph available on ui? And does it get updated after every run?

avatar
Expert Contributor

Got it. Its on Data Provenance dialog box.