Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to replicate a file in NiFi

avatar
Explorer

I have the issue whereby I receive a file into NiFi on a single node but wish to copy this onto multiple nodes.

Has anyone done this before and what would be the process to do so?

16 REPLIES 16

avatar
Master Collaborator

Why do you want to do that ? NiFi cluster is not like HDFS where data will be replicated across nodes NiFi is not a storage system. 

What NiFi provides as , let's say you have 3 node cluster and you receive 100 files on one node assuming Processor which one is receiving file running on Primary node,now you want to processor these all 100 files in distributed manner using all 3 nodes , then you need to use Load Balancing at queue connection to distribute 100 file on 3 nodes , 33/33/34 files on each node then rest further downstream processing will be done on all 3 nodes.  Example flow design flow be, ListSFTP (Primary)-->Connection set to LB at queue tion --> FetchSFTP, in this use case ListSFTP will list the file details on Primary node then FetchSFTP will fetch the file from SFTP server on All 3 nodes in a distributed manner.

 

Thanks 

 

avatar
Explorer

The file is placed by a third party onto an SFTP location on one node, but we want the file to be available on all nodes so that the receiving team can access it.  We'll then run a job to remove it on a daily basis prior to the next file being deposited.

avatar
Master Collaborator

Sorry some confusion here ,so you saying The file is placed by a third party onto an SFTP location on one node?  so the file is on SFTP server at this stage but only on node ? and then you fetch the file into NiFi using List/FetchSFTP processor running on Primary node only ? then ? what is the end goal "we want the file to be available on all nodes so that the receiving team can access it". What do you mean by All nodes is these NiFi nodes or SFTP nodes ? 

avatar
Explorer

The team logs into the system but cannot guarantee which data node in the cluster they connect to.  Currently the flow we have places the file (with the updated attributes - user name and file location) on the same data node it arrived on.  What we want to do is take the original file, duplicate it, and place a copy on all nodes so that when the end user logs in they can access the file.  No matter what node they end up on.

I have managed to edit the flow using the duplicateflowfile mentioned below but haven't managed to get the putfile processor to direct the copies one to each data node.

avatar
Explorer

Tryfan_0-1654858616827.png

 

avatar

Have you set the load balancing strategy on the queue upstream the put file to round robin. If that doesnt work for some reason, then your only option is to create a shared folder on all nodes, then you will have a workflow that picks up the file then you add as many PUTFile processors as the number of your nodes where each put file will save the file to a given node shared folder.

avatar

Hi,

There is a processor called DuplicateFlowFile where you configure it with the Number of Copies that you wish to have. I assume in your case if you want to process it on multiple nodes then the number of copies should be n-1 (where n is number of nodes and -1 because you still have the original flow file). In the downstream queue for this processor success relationship make sure to configure the queue Load Balance Strategy to "Round Robin" where each flowfile will be send to a different node. Hope that helps.

avatar
Explorer

I've put in the duplicateflowfile processor as shown in the screenshot above.  However it is still only sending to a single Data Node.  Is there a step or configuration I'm missing?

avatar

I'm not seeing the load balancing being set on the success relationship of the DuplicateFlowFile. That will basically mean that the PutFile processor will be executed on each node (as many as the flowfile coming out of the duplicate processor.