Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

MergeContent defrag errors when handling multiple Fragments at Once

Solved Go to solution

MergeContent defrag errors when handling multiple Fragments at Once

Expert Contributor

Hi

I have the processors unpackContent -> MergeContent. I use this to untar a file and then zip the files. I am using the defragment merge strategy and have been noticing that when MergeContent has to handle many flowfiles at once (flowfile queue builds up before MergeContent) from many different fragments I get "Expected number of fragments is X but only getting Y".

Simply routing failures back to merge content or creating a run schedule delay helped solve this but wondering why this would be happening.

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: MergeContent defrag errors when handling multiple Fragments at Once

Master Guru

@mliem

Would you mind sharing your MergeContent processor's configuration?

How large is the volume of tar files coming in to you flow?

How many concurrent task do you have on your unpackContent?

The reason I ask these questions is because they may all play a factor in why you are seeing the behavior you reported.

My first thought would be that you have too few bins configured in your MergeContent processor. The MergeContent processor will start placing FlowFiles from the incoming queues in to bins based on the "Correlation Attribute Name" configured (Should be in your case "fragment.identifier"). If the MergeContent processor runs out of available bins unique bins, the oldest bin is merged. In you case since that oldest bin is incomplete (does not contain all fragments), it is routed to failure.

For example you have Maximum number of bins configured to 10 and your incoming queue contains FlowFiles that we produced from more then 10 original tar files. It is possible that the Merge Content processor may be looking to create that 11th bin before all the FlowFiles that correlate to any of the other bins are processed.

There are a few things you could try here (1 being most recommended and then bottom of list being the last thing I would try.):

1. Increase "Maximum number of bins" property in MergeContent.

2. Add the "OldestFlowFileFirstPrioritizer" to "Selected Prioritizers" list in the queue feeding your MergeContent. This will have a small impact on throughput performance. When UnpackContent splits your tar files all split files will have similar FlowFile creation timestamps. By setting the above prioritizer, FlowFiles will be placed in bins in timestamp order. If using this strategy, you would still need to have the number of bins set to the number concurrent tasks assigned to your UnpackContent processor plus one.

3. Decrease the "BackPressure Object Threshold" configuration on the incoming queue to the MergeContent processor. This is a soft limit. So lets say you have it set to 1000 and your Unpack Content untar resulted in 2000 FlowFiles, the queue would jump to 2000. The UnpackContent processor would then stop until that threshold dropped back below 1000. This would set few FlowFiles for your MergeContent processor to bin (meaning fewer needed bins).

Thanks, Matt

View solution in original post

2 REPLIES 2
Highlighted

Re: MergeContent defrag errors when handling multiple Fragments at Once

Master Guru

@mliem

Would you mind sharing your MergeContent processor's configuration?

How large is the volume of tar files coming in to you flow?

How many concurrent task do you have on your unpackContent?

The reason I ask these questions is because they may all play a factor in why you are seeing the behavior you reported.

My first thought would be that you have too few bins configured in your MergeContent processor. The MergeContent processor will start placing FlowFiles from the incoming queues in to bins based on the "Correlation Attribute Name" configured (Should be in your case "fragment.identifier"). If the MergeContent processor runs out of available bins unique bins, the oldest bin is merged. In you case since that oldest bin is incomplete (does not contain all fragments), it is routed to failure.

For example you have Maximum number of bins configured to 10 and your incoming queue contains FlowFiles that we produced from more then 10 original tar files. It is possible that the Merge Content processor may be looking to create that 11th bin before all the FlowFiles that correlate to any of the other bins are processed.

There are a few things you could try here (1 being most recommended and then bottom of list being the last thing I would try.):

1. Increase "Maximum number of bins" property in MergeContent.

2. Add the "OldestFlowFileFirstPrioritizer" to "Selected Prioritizers" list in the queue feeding your MergeContent. This will have a small impact on throughput performance. When UnpackContent splits your tar files all split files will have similar FlowFile creation timestamps. By setting the above prioritizer, FlowFiles will be placed in bins in timestamp order. If using this strategy, you would still need to have the number of bins set to the number concurrent tasks assigned to your UnpackContent processor plus one.

3. Decrease the "BackPressure Object Threshold" configuration on the incoming queue to the MergeContent processor. This is a soft limit. So lets say you have it set to 1000 and your Unpack Content untar resulted in 2000 FlowFiles, the queue would jump to 2000. The UnpackContent processor would then stop until that threshold dropped back below 1000. This would set few FlowFiles for your MergeContent processor to bin (meaning fewer needed bins).

Thanks, Matt

View solution in original post

Highlighted

Re: MergeContent defrag errors when handling multiple Fragments at Once

Expert Contributor

@Matt Clarke

Thanks Matt, very useful info.

It was about 20 tar files, which turned into almost 1000 individual files that I was looking to ZIP back to 20 files. Looks like the major problem was the bin #. It was set to 1, once I increased that it had no problem with the multiple tar files that were queued up.

I only had 1 concurrent tasks so I was surprised that even with 1 bin, it would look to create a new bin. For selected prioritizers it was the default " first in first out", so if its untaring one tar file at a time it should finish a whole bin before moving to the next one.

Don't have an account?
Coming from Hortonworks? Activate your account here