Thanks for your inputs.The above solution worked perfectly fine in my case both in terms of the error and performance.But as you already mentioned above in this situation we have a large number of files in the HDFS. Even if I use a MergeContent processor in the flow I am getting more than I files.For what I can understand by looking at the provenence the MergeContent processor is merging files in block.Say we have 100 flow files coming to the MergeContent processor batches of 30,30,20,20.If will not wait for 100 files and generate 4 output files by merging in groups.Is there a way by which we can control this behavior and enforce it to produce only 1 output files for each output path.
This is the configuration of MergeContent processor.Any inputs will be very helpful.
The mergeContent Processor simply bins and merges the FlowFiles it sees on an incoming connection at run time. In you case you want each bin to have a min 100 FlowFiles before merging. So you will need to specify that in the "Minimum number of entries" property. I never recommend setting any minimum value without also setting the "Max Bin Age" property as well. Let say you only ever get 99 FlowFiles or the amount of time it takes to get to 100 exceeds the useful age of the data being held. Those Files will sit in a bin indefinitely or for excessive amount of time unless that exit age has been set.
Also keep in mind that if you have more then one connection feeding your mergeContent processor, on each run it looks at the FlowFiles on only one connection. It moves in round robin fashion from connection to connection. NiFi provides a "funnel" which allows you to merge FlowFiles from many connections to a single connection.