Considering data is often times being "sent into NIFI" from other processes, and "sent out of NIFI" to other processes, data safety is a legitimate concern. When data is being exchanged between processes, we have the following possible delivery semantics:
1. at most once delivery
2. at least once delivery
3. exactly once delivery
First of all, speaking of the communication between multiple NIFI/MINIFi instances via site-to-site protocol, because of the 2 phase commit provided at the framework level, at least once delivery is guaranteed. And not only that, the chance of "exactly once" delivery is maximized. "Exactly once delivery" isn't guaranteed only if your system breaks between the two commits (literally between those two lines of code), or if there was any error in NIFI when committing the state of things. As an example, after data is being delivered to the receiving side, but before your sending side commits, network connection is somehow lost. In that case, sending side will try to send the data again when network connectivity is regained. you end up with duplications on the receiving side. The way to handle that is to leverage DetectDuplicate processor, put it right after the S2S port.
In a very similar manner, when NIFI S2S libraries are embedded into other systems like Storm or Spark, given the nature of S2S communication protocol, at least once delivery is guaranteed.
Now, at NIFI processor level, it is really a case by case scenario, depending on how data transport is handled by that processor. For example, for external systems that support 2-phase commit like Kafka, similar to S2S, we can guarantee at least once delivery while maximizing the chance of exactly once delivery, if the processor is properly written. If the other system doesn't support 2-phase commit like Syslog protocol, there is nothing we can do at NIFI framework level given that the external system is out of our control. NIFI can only guarantee at most once delivery.