Support Questions

VidyaSargur · ‎05-08-2022

In Apache Nifi, there are connections between each processors, which acts like queue of FlowFiles, and Nifi by default persists data content of FlowFile on disk. Does it mean each of such connection persists FlowFiles on disk? If that were true, each time of delivery of FlowFiles from one processor to another would mean one disk read and write, thus more processors would lead to more disk reads and writes, which in turn would lower the entire throughput. Is my understanding correct? and what is the best practice to avoid it, writing all things in one processor? Thanks.

araujo · ‎05-17-2022

@Threepwood ,

What you're proposing here is something called Stateless NiFi execution.

In a normal NiFi flow data is always shared in the form of flowfiles and there's no way to work around that. I/O is a tax that NiFi pays to make it more flexible. It's know to be I/O heavy but it still performs very well and can handle huge volumes of data if you follow the best practices when building your flows.

Stateless NiFi execution can be used for a subset of NiFi flows that don't need to store state. In a NiFi Stateless execution, nothing is written to disk. Data is sent from one processor to another through function calls. It can achieve much faster performance this way, but it's limited to stateless flows.

Mark Payne has another video where he talks about this and how to use it to achieve Exactly-Once delivery from Kafka to Kafka: https://www.youtube.com/watch?v=VyzoD8eh-t0

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

View solution in original post

araujo · ‎05-16-2022

@Threepwood ,

Flowfiles in NiFi are always persisted on disk. It doesn't mean, though, that they are always written for every processor. Flowfiles only need to be written when they changed. If they don't NiFi will use the same flowfile on disk across processors.

An excessive number of flowfiles, though, does affect performance and it's an anti-pattern to have too many very small flowfiles. It's a best practice to use record-based processor to avoid this. Please check out Mark Payne's YouTube series on NiFi Anti-Patterns for more on this: https://www.youtube.com/watch?v=RjWstt7nRVY&t=302s

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Threepwood · ‎05-17-2022

Many Thanks.

I want to confirm one more thing, it seems each content-accessing processor needs to read content from disk, even when two of such processors are directly connected to each other. E.g. I have a ConsumeKafkaRecord_2_0 leads to a PutElasticsearchHttpRecord, the former writes to disk, while the latter reads from disk. However, if the content can be cached in memory (and meanwhile synced in disk), there would be one disk IO saved, so is there any configure properties to make content cached in memory?

If there were such option, It should improve the overall throughput, otherwise, it seems better to merge all content-accessing processor into one single processor to save disk IO, correct?

araujo · ‎05-17-2022

@Threepwood ,

What you're proposing here is something called Stateless NiFi execution.

In a normal NiFi flow data is always shared in the form of flowfiles and there's no way to work around that. I/O is a tax that NiFi pays to make it more flexible. It's know to be I/O heavy but it still performs very well and can handle huge volumes of data if you follow the best practices when building your flows.

Stateless NiFi execution can be used for a subset of NiFi flows that don't need to store state. In a NiFi Stateless execution, nothing is written to disk. Data is sent from one processor to another through function calls. It can achieve much faster performance this way, but it's limited to stateless flows.

Mark Payne has another video where he talks about this and how to use it to achieve Exactly-Once delivery from Kafka to Kafka: https://www.youtube.com/watch?v=VyzoD8eh-t0

Cheers,

André

--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Cloudera Community

Support Questions

Does more processors in Apache Nifi lead to lower throughput?