I have 2 Sites (FE & BE )that on each I need to install the NIFI Cluster. FE site is required to transfer data to the BE site using site-to-site protocol.
I do know that from time to I have connectivity issues between the sites. since I don't want to lose data, is it possible to configure the FE NIFI cluster to Keep data for 2-3 days in case of a network disconnection between the sites? I do know it will require more disk space
If it is possible, which repository I need to extend the Content or the Flow file? and how can it be done?
@dzbeda You should extend the Content repo as it will store the content of your flowfiles.
Please go through the best practices for setup https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-hi...
Also note that NiFi should not be used as backup for any failover mechanism as it is meant for processing and not for storage. I would recommend to store the data to may be some centralised location from where you can store and retrieve data as per use case and in case of failure as well you can have backup on that location. NiFi can be used to hold data for a day or 2 but if it is not processed then it will hold data at content repository and can fill the content repo storage.
Can you share a little more about your use case?
NiFi does not expire data that is actively queued within connections between components added to the NiFi canvas. So I am a bit curious on the "I don't want to lose data" statement you made.
It is true that during times of "connectivity issues between the sites" that NiFi FlowFile may accumulate within the connection queues resulting in more storage being needed to hold that queued data while you wait for the connectivity to restore, but still not a concern for "data loss" unless your ingest is using some unconfirmed transfer protocol like UDP. NiFi's Site-To-Site protocol used by the Remote Process Groups uses a two phase commit to avoid dataloss.
Backpressure settings on each connection can control how many FlowFiles can queue before the component feeding FlowFiles into the connection is o longer allowed to execute. So in an extended outage or high volume, backpressure could end up being applied to all connection from last component in your dataflow to the first component in your dataflow. Default object thresholds are (10,000 FlowFiles or 1 GB of content size). Keep in mind these are soft limits. Not advisable to simply set backpressure to some much larger value. I recommend reading following article:
As far as what happens when the content repo(s) (NiFi allows you to configure multiple content repos per NiFi instance) are full, NiFi simply can not generate any new content. So any component that tries to create new content (can be at ingest or via some processor that modifies the content of an existing FlowFile) will simply fail went it tries to do so with an out of disk space exception. This does not mean dataloss (unless as I mentioned your ingest or egress uses an unconfirmed protocol). The component will simply try again until it is successful once disk space becomes available (For example when connectivity returns and data can be pushed out).
Using good protocols would result on data remaining on source once backpressure is applied all the way back to your ingest based components.
NiFi archiving has nothing to do with how long FlowFiles are kept in NiFi's dataflow connections. Archiving holds FlowFiles after they have successfully been removed (reached point of auto-termination in a dataflow. Archiving allows you to view old FlowFiles no longer queued or replay a FlowFiles from any point in your dataflow. However, there is no bulk replay capability, so not useful for that.
Hope this helps,
I recommend not using nifi and working with the console.
Using nifi is not recommended because additional logs are generated.
It is recommended to divide and compress the files that you want to move from the console into appropriate sizes and send them.
For HDFS, you can use the distcp command.