Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Issue with Nifi Merge Content : Files stay in the queue infinitely !

I have a flow where I am using the Merge Content Processor. I noticed lately that some flowfiles stay infinitely in the queue just before the Merge Content. I can't figure out the issue so I am asking for your help !

This is the part of the flow that I am talking about :

13492-1.png

The configuration of the merge content processor is here (merging in the attribute called "cle" and its value is the same for the 2 flowfiles in the queue ! But still they don't merge ) :

13493-2.png

Finally here is the content of the queue :

13494-3.png

Is this due to the first flowfile size (710 MB) ? is there a maximum size for a bin ? If yes why isn't it merged after reaching that size ?

Thank you for your help !

1 ACCEPTED SOLUTION

Mentor

@Mohammed El Moumni

Each Node in a NiFi cluster runs its own copy of the dataflow and works on its own set of FlowFiles.

13628-screen-shot-2017-03-14-at-14329-pm.png

Looking at the screenshot you have above of your queue list, you can see that the two FlowFiles are not on the same node. So each node is running a MergeContent processor and each node is waiting for another FlowFile to complete their bins. You will need to look back earlier in your dataflow to see how your data is being ingested by your nodes to make sure that the matching sets of files end up on the same node for merging.

Thanks,

Matt

View solution in original post

13 REPLIES 13

@Mohammed El Moumni

A queue has a limit in size (1 GB) or 10,000 files by default.

To change the settings go to setting tab on "Configure" of that queue. See screenshot attached.

If it helps, please vote/accept response.

13505-screen-shot-2017-03-10-at-100448-am.png

It is also possible that downstream you may have another queue or processor stuck due to this limit set by default. You have to increase there and let the processor start processing to reduce the amount in the queue before your queue report may start to drain. Imagine all this flow like a river with all kind of streams and obstructions...

Hi @Constantin Stanca, I changed the back pressure data size to 2GB but the two flowfiles still don't merge ...

13554-4.png

Mentor
@Constantin Stanca

@Mohammed El Moumni

Queue thresholds are per node and will cause a queue to no longer accept additional FlowFiles, It will not prevent downstream processor from processing FlowFiles that are already in that queue.

Had he received two 700MB CSV files on one node, then the 1GB threshold would have been exceeded thus preventing any additional FlowFiles from entering that queue (including the corresponding 70 byte header files). In that case you would be stuck, since merge would not have the files even on a single node needed to merge a bin.

Thanks,

Matt

Rising Star

@Mohammed El Moumni

If you take a look at the details of the flowfiles in the input queue for MergeContent, do you see the correlation attribute present on both flowfiles? Is it possible that, elsewhere in the flow, a flowfile with a correlation ID the same as one of the two flowfiles in the incoming queue was sent to a failure relationship and had been dropped from the flow? In the past, I have done a bit of processing of files from one of the Split* processors, and encountered errors processing one of the fragments. Due to the way I had designed the flow, the fragment with the error was routed to a failure relationship to another processor that terminated the processing of that flowfile, so not all the fragments from the split were sent to MergeContent. This caused all the other fragments to sit in the incoming queue of MergeContent indefinitely.

Hi @Jeff Storck, the correlation attribute is present on both flowfiles and its value is the same. Also, I am sure that for a correlation attribute value, only two flowfiles will have that value. So with my settings : Minimum number of entries = 2, maximum number of entries = 2, I am sure that only those two flowfiles will merge. Still, in my case the two flowfiles in the screenshot stay infinitely in the queue ... I am pretty sure it's a size problem, but can't figure it out.

Rising Star

@Mohammed El Moumni Are other, smaller files merging? I notice in both of your screenshots that the MergeContent processor is stopped, which will prevent files from being merged. Was the processor stopped just to take the screenshots?

@Jeff Storck yes the processor was stopped just to take the screenshots (I left it for running for 1 day and the two files didn't merge). And yes smaller files merge (15MB files merge for example).

Mentor

@Mohammed El Moumni

Each Node in a NiFi cluster runs its own copy of the dataflow and works on its own set of FlowFiles.

13628-screen-shot-2017-03-14-at-14329-pm.png

Looking at the screenshot you have above of your queue list, you can see that the two FlowFiles are not on the same node. So each node is running a MergeContent processor and each node is waiting for another FlowFile to complete their bins. You will need to look back earlier in your dataflow to see how your data is being ingested by your nodes to make sure that the matching sets of files end up on the same node for merging.

Thanks,

Matt

Rising Star

good eyes @Matt Clarke 🙂

Mentor

@Raj B

Thank you... Sometimes the most important piece of information is in the fine details. Other give away that it was clustered was that both FlowFiles in that queue had same position "1". Two FlowFiles in the same queue on the same node cannot occupy the same position.

@Matt Clarke This is an excellent answer, thank you very much. I am indeed using a cluster of nifi nodes, and my dataflow starts with a list/fetch as described by the answer of @Pierre Villard on this question : https://community.hortonworks.com/questions/52112/nifi-load-distribution-in-getfile-processor.html

So the beginning of my dataflow looks like this :

13647-5.png

I am using the list/fetch pattern to take advantage of the cluster and improve the performance of the ingestion.

This leads me to ask the following question which is probably beyond the scope of the initial question and should be asked in the different post, but I am putting it here so that everyone in the same situation profits from your beautiful answers : does this mean that I can't use the merge content processor in these kind of dataflows (dataflows thar run on all nodes), as I don't have a way to control the node that will ingest a pair of matching flowfiles (flowfiles that have the same "cle" attribute) ? or could you think of a trick to handle this ?

Thanks again for your help !

Mentor

@Mohammed El Moumni Here is one possible dataflow design that can be used to make sure both FlowFiles in a pair end up on the same node after being distributed via the Remote Process Group (RPG):

13715-screen-shot-2017-03-17-at-105928-am.png

While it requires adding 5 additional processor to you flow, overhead is relatively light since you are dealing with very small FlowFiles all the way up to the point of the FetchFile processor. You are still only fetching the ~700 MB content after cluster distribution.

Thanks,

Matt

Great answer like usual ! Just tested your suggestion and it works perfectly ! Thank you so much !

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.