Support Questions

Find answers, ask questions, and share your expertise

How to replicate a file in NiFi

Explorer

I have the issue whereby I receive a file into NiFi on a single node but wish to copy this onto multiple nodes.

Has anyone done this before and what would be the process to do so?

16 REPLIES 16

Expert Contributor

Why do you want to do that ? NiFi cluster is not like HDFS where data will be replicated across nodes NiFi is not a storage system. 

What NiFi provides as , let's say you have 3 node cluster and you receive 100 files on one node assuming Processor which one is receiving file running on Primary node,now you want to processor these all 100 files in distributed manner using all 3 nodes , then you need to use Load Balancing at queue connection to distribute 100 file on 3 nodes , 33/33/34 files on each node then rest further downstream processing will be done on all 3 nodes.  Example flow design flow be, ListSFTP (Primary)-->Connection set to LB at queue tion --> FetchSFTP, in this use case ListSFTP will list the file details on Primary node then FetchSFTP will fetch the file from SFTP server on All 3 nodes in a distributed manner.

 

Thanks 

 

Explorer

The file is placed by a third party onto an SFTP location on one node, but we want the file to be available on all nodes so that the receiving team can access it.  We'll then run a job to remove it on a daily basis prior to the next file being deposited.

Expert Contributor

Sorry some confusion here ,so you saying The file is placed by a third party onto an SFTP location on one node?  so the file is on SFTP server at this stage but only on node ? and then you fetch the file into NiFi using List/FetchSFTP processor running on Primary node only ? then ? what is the end goal "we want the file to be available on all nodes so that the receiving team can access it". What do you mean by All nodes is these NiFi nodes or SFTP nodes ? 

Explorer

The team logs into the system but cannot guarantee which data node in the cluster they connect to.  Currently the flow we have places the file (with the updated attributes - user name and file location) on the same data node it arrived on.  What we want to do is take the original file, duplicate it, and place a copy on all nodes so that when the end user logs in they can access the file.  No matter what node they end up on.

I have managed to edit the flow using the duplicateflowfile mentioned below but haven't managed to get the putfile processor to direct the copies one to each data node.

Explorer

Tryfan_0-1654858616827.png

 

Super Collaborator

Have you set the load balancing strategy on the queue upstream the put file to round robin. If that doesnt work for some reason, then your only option is to create a shared folder on all nodes, then you will have a workflow that picks up the file then you add as many PUTFile processors as the number of your nodes where each put file will save the file to a given node shared folder.

Super Collaborator

Hi,

There is a processor called DuplicateFlowFile where you configure it with the Number of Copies that you wish to have. I assume in your case if you want to process it on multiple nodes then the number of copies should be n-1 (where n is number of nodes and -1 because you still have the original flow file). In the downstream queue for this processor success relationship make sure to configure the queue Load Balance Strategy to "Round Robin" where each flowfile will be send to a different node. Hope that helps.

Explorer

I've put in the duplicateflowfile processor as shown in the screenshot above.  However it is still only sending to a single Data Node.  Is there a step or configuration I'm missing?

Super Collaborator

I'm not seeing the load balancing being set on the success relationship of the DuplicateFlowFile. That will basically mean that the PutFile processor will be executed on each node (as many as the flowfile coming out of the duplicate processor.

Explorer

Current setup is as below.  

Tryfan_0-1655111441630.png

 

Basically to reiterate the concept.  We wish to receive the single file uploaded on a single data node (1 of 7), process that through NiFi to present an individual copy of that file on every data node.  Currently I am getting a multiple created by the duplicateflowfile processor but this is not being placed on the individual nodes.

Rising Star

I have done similar here when I need to deliver jar files to all nodes.   It's really a "this is not how things are done", but in this case I did not have access to the node's file system without doing this in a flow.  So that said, it works great!     The first proc creates a flowfile on all nodes (even when I dont know the number), then it checks, if not found, proceeds to get the file and write it to the file system.   

 

Screen Shot 2022-06-13 at 9.35.46 AM.png

 

 

Master Guru

@Tryfan 

I think the concept of sending a file to one node is what needs to change here.  BY sending to a single node in the NiFi cluster you create a single point of failure.  What happens if that one node on your 7 node cluster goes down?  You end up with none of the  nodes getting that file and outage to your dataflow.

A better design is to place this file somewhere that all nodes can pull it from.

Maybe it is a commonly mounted file system to all 7 nodes. (getFile processor)?
Maybe an external SFTP server (GetSFTP processor)?
etc...

Then you construct a dataflow where all nodes are retrieving a file independently as needed.

Thanks,

Matt

Explorer

@MattWho - so we have the file coming into the system via a loadbalancer which due to other intricacies, is configured to only ingest on one data node.  We don't have the option of an SFTP server so I have to figure this out on the canvas.

Expert Contributor

Ok, so you got the copy file in NiFi using List/FetchFile then you changed the new directory location where it has to be written using UpdateAttribute then you have used DuplicateFlowFile to create duplicates, till now what is happening since ListFile is running on Primary node (And it should be running on Primary ) rest all downstream flow will be processed/computed on Primary that why you see DuplicateFlowFile creating duplicated on node which is Primary in NiFi Cluster. I do not see any issue here. If you wish to have flow files on All NiFi nodes redistributed after DuplicateFlowFile then you need to set Load Balancing at queue connection on the connection "Round Robin " between DuplicateFlowFile & PutFile, by doing this duplicates flow files will be distributed among nifi nods in round-robin manner one by one and will be written by PutFile back on each nifi nodes where files got distributed after  DuplicateFlowFile.  Now one question you mention 7 HDFS data nodes, does that means you have  NiFi service running as well on the same 7 data nodes ? and having 7 node nifi cluster? in that case, only all 7 nodes will receive each file copy. I see 

Explorer

So I have a single file coming into a single data node (one of 7 as part of a cluster).  This file I need to fetch onto NiFi and process through the flow so that a copy is placed on all available data nodes.

I update the attributes to change the files location on the destination and have a put file to place it there.  I have since included a duplicateflowfile processor to copy the file 6 times (total of 7 including the original) but with round robin on the connector, this isn't distributing it across the data nodes correctly.

I am now looking at adding a distributeload processor after the duplicateflowfile and configuring it to direct at 7 individual putfile processors.  However, unsure of how I can configure these to place on a specific data node.

Can i include the hostname in the directory field?

Master Guru

@Tryfan 

You mention this file comes in daily.  You also mention that this file arrives through a load-balancer so you don't know which node will receive it.  This means you can't configure your source processor for "primary node" only execution as you have done in your shared sample flow with the ListFile.   As Primary Node only, the elected primary node will be the only node that executes that processor.  So if the source file lands on any other node, it would not get listed.

You could handle this flow in the following manor:

GetFile ---> (7 success relationships) PostHTTP or InvokeHTTP (7 of these with one configured for each node in your cluster cluster)
ListenHTTP --> UpdateAttribute --> PutFile

MattWho_0-1656508418879.png


So in this flow, no matter which node receives your source file, the GetFile will consume it. It will then get cloned 6 times (7 copies then exist) with one copy of the FlowFile getting routed to 7 unique PostHttp processors.  Each of these targets the ListenHTTP processor listening on each node in your cluster. That ListenHTTP processor will receive all 7 copies (one copy per node) of the original source file.  Then use the UpdateAttribute to set your username and location info before the putFile which place each copy in the desired location on the source node.  

If you add or remove nodes from your cluster, you would need to modify this flow accordingly which is a major downside to such a design.  Thus the best solution is still one where the source file is placed somewhere all nodes can retrieve it from so it scales automatically.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

 

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.