Our application is currently running load through a split content, doing some enrichment on the 2 split pieces in parallel, then merging the results back together again. At a glance there were no problem with the merge content processor, but when we actually started measuring the time it was taking to to merge flowfiles we could see some results under 100ms, but also a good chunk that spread thinly from 100 to 2000ms. A little research led to the reminder that flowfile content is kept on disk until its merged, which means that we have to deal with a large number of random disk reads, and that seems to explain the slow down.
Is there something, anything we can do to keep the reads towards the lower end?
How about a way to pre-fetch or cache the flowfile content?
An alternative solution I was knocking around in my head, is there a way to merge just attributes for two flowfiles to skip the diskread? The payloads are small enough to fit as an attribute.
What format is your original data in?
If it is a well known format like json, csv, Avro, or logs, then you may want to look at the record-based processors in Apache NiFi 1.2.0 and 1.3.0. With those you would be able to avoid having to split and merge in the first place, and instead treat the flow file as a set of records and update/enrich the records in place (UpdateRecord).
The payloads that we are merging are json, unfortunately we are currently stuck with nifi 1.1 until we can get an upgrade, so a change there is not an immediate fix.