Created 09-27-2023 02:36 AM
I have a nifi flow. I observe that if input file is split into smaller files and fed into the flow one by one, then overall time taken(sum of time taken for individual files) is considerably low compared to when I feed single big file.
What can be a possible cause for this performance difference?
Note:
Flow has many processors that use avro readers/writers.
I calculate time using following in a LogMessage processor:
${now():toNumber():minus(${lineageStartDate}):format("HH:mm:ss", "GMT")}
Created 09-27-2023 05:54 AM
@manishg
The first thing that comes to mind is JVM heap. You may want to collect and look at garbage collection data with large files versus small files.
Second would be identifying which processor(s) the largest FlowFiles spend the most time at. For this I would suggest looking at the provenance lineage for the large FlowFiles. There is a slide bar at bottom of that lineage graph that you can scroll to see progression of FlowFile through the lineage tree. What processor follow did the FlowFile spend longest at and how are they configured?
Hope this helps,
Matt
Created 09-27-2023 05:54 AM
@manishg
The first thing that comes to mind is JVM heap. You may want to collect and look at garbage collection data with large files versus small files.
Second would be identifying which processor(s) the largest FlowFiles spend the most time at. For this I would suggest looking at the provenance lineage for the large FlowFiles. There is a slide bar at bottom of that lineage graph that you can scroll to see progression of FlowFile through the lineage tree. What processor follow did the FlowFile spend longest at and how are they configured?
Hope this helps,
Matt
Created 09-27-2023 07:52 PM
Where is this graph available on ui? And does it get updated after every run?
Created 09-27-2023 07:57 PM
Got it. Its on Data Provenance dialog box.