Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Load balancing while the fetching of file from a standalone NIFI instance doing the processing it in a NIFI cluster.

avatar
Rising Star

We have a NIFI setup where in we have a NIFI cluster installed in a hadoop cluster and a standalone NIFI instance running on another server.The input file will be generated in the file system of the server of the standalone instance.We are fetching the file using ListFile/FetchFile processor in the standalone instance.Then in the main cluster we are connecting to the standalone instance using RPG group and then send the output to the NIFI cluster(RPG) using site-site.As per my understanding the part of the processing done inside the cluster will be distributed.Is this understanding correct?I also would like to know if there is a way to distribute the fetching of the source file that we are doing in the standalone NIFI instance?

The flow we are using

In the standalone instance

ListFile->FetchFile->Outputport

In the NCM of the cluster

RPG(Standalone NIFI Instance)-->RPG(NIFI cluster)

Inputport->Routetext->PutHDFS(This is the processing done in the main cluster)

Let me know if you need any other information.Any inputs will be appretiated.

Regards,

Indranil Roy

1 ACCEPTED SOLUTION

avatar
Super Mentor
@INDRANIL ROY

The output from the SplitText and RouteText processors is a bunch of FlowFiles all with the same filename (filename of the original FlowFile they were derived from.) NiFi differentiates these FlowFiles by assigning each a Unique Identifier (UUID). The problem you have is then writing to HDFS only the first FlowFile written with a particular filename is successful. All others result in the error you are seeing. The MergeContent processor you added reduces the impact but does not solve your problem. Remember that nodes do not talk to one another or share files with one another. So each MergeContent is working on its own set of files all derived from the same original source file and each node is producing its own merged file with the same filename. The first node to successfully write its file HDFS wins and the other nodes throw the error you are seeing. What is typically done here is to add an UpdateAttribute processor after each of your MergeContent processor to force a unique name on each of the FlowFiles before writing to HDFS.

The uuid that NiFi assigns to each of these FlowFiles is often prepended or appended to the filename to solve this problem:

7401-screen-shot-2016-09-06-at-124041-pm.png

If you do not want to merge the FlowFiles, you can simply just add the UpdateAttribute processor in its place. YOu will just end up with a larger number of files written to HDFS.

Thanks,

Matt

View solution in original post

21 REPLIES 21

avatar
Rising Star

@mclark

Thanks for your inputs.The above solution worked perfectly fine in my case both in terms of the error and performance.But as you already mentioned above in this situation we have a large number of files in the HDFS. Even if I use a MergeContent processor in the flow I am getting more than I files.For what I can understand by looking at the provenence the MergeContent processor is merging files in block.Say we have 100 flow files coming to the MergeContent processor batches of 30,30,20,20.If will not wait for 100 files and generate 4 output files by merging in groups.Is there a way by which we can control this behavior and enforce it to produce only 1 output files for each output path.

mergecontent.png

This is the configuration of MergeContent processor.Any inputs will be very helpful.

Regards,

Indranil Roy

avatar
Super Mentor

The mergeContent Processor simply bins and merges the FlowFiles it sees on an incoming connection at run time. In you case you want each bin to have a min 100 FlowFiles before merging. So you will need to specify that in the "Minimum number of entries" property. I never recommend setting any minimum value without also setting the "Max Bin Age" property as well. Let say you only ever get 99 FlowFiles or the amount of time it takes to get to 100 exceeds the useful age of the data being held. Those Files will sit in a bin indefinitely or for excessive amount of time unless that exit age has been set.

Also keep in mind that if you have more then one connection feeding your mergeContent processor, on each run it looks at the FlowFiles on only one connection. It moves in round robin fashion from connection to connection. NiFi provides a "funnel" which allows you to merge FlowFiles from many connections to a single connection.

Matt