Support Questions

manishg · ‎09-27-2023

I have a nifi flow. I observe that if input file is split into smaller files and fed into the flow one by one, then overall time taken(sum of time taken for individual files) is considerably low compared to when I feed single big file.

What can be a possible cause for this performance difference?

Note:

Flow has many processors that use avro readers/writers.

I calculate time using following in a LogMessage processor:

${now():toNumber():minus(${lineageStartDate}):format("HH:mm:‌ss", "GMT")}

MattWho · ‎09-27-2023

@manishg

The first thing that comes to mind is JVM heap. You may want to collect and look at garbage collection data with large files versus small files.

Second would be identifying which processor(s) the largest FlowFiles spend the most time at. For this I would suggest looking at the provenance lineage for the large FlowFiles. There is a slide bar at bottom of that lineage graph that you can scroll to see progression of FlowFile through the lineage tree. What processor follow did the FlowFile spend longest at and how are they configured?

Hope this helps,

Matt

View solution in original post

MattWho · ‎09-27-2023

@manishg

The first thing that comes to mind is JVM heap. You may want to collect and look at garbage collection data with large files versus small files.

Second would be identifying which processor(s) the largest FlowFiles spend the most time at. For this I would suggest looking at the provenance lineage for the large FlowFiles. There is a slide bar at bottom of that lineage graph that you can scroll to see progression of FlowFile through the lineage tree. What processor follow did the FlowFile spend longest at and how are they configured?

Hope this helps,

Matt

manishg · ‎09-27-2023

Where is this graph available on ui? And does it get updated after every run?

manishg · ‎09-27-2023

Got it. Its on Data Provenance dialog box.

Cloudera Community

Support Questions

Performance diff between single big file vs multiple smaller files