Support Questions
Find answers, ask questions, and share your expertise

Custom MergeContent based on different values for an attribute

Contributor

We have seven different branches/lines of flow and at the terminal processors, we want to merge the output FlowFiles. These seven branches starts from a singe GenerateFlowFile, which in turn is triggered to run 3 times a day (1 AM, 7 AM and 12 PM). However, the execution time of the branches are different, so we need for each one of them to finish. At the merging part (funnel + MergeContent processor), we use the batch-time (trigger timestamp) as an attribute in order to bundle the 7 output FlowFiles. Things work out well for the most part, until one time, one of the branches has some mistake that it outputted two FlowFiles for same cycle instead of one.

So for example, these FlowFiles were bundled together, while Branch 7 is not yet done:

FlowFile1

FlowFile2

FlowFile3

FlowFile4

FlowFile5

FlowFile6

FlowFile6

Assuming we use a FlowFile attribute identifying the branch name, how can we bundle the same batch-time FlowFiles while guaranteeing that all branches are covered in the bundle?

1 REPLY 1

Re: Custom MergeContent based on different values for an attribute

Super Guru
@J. D. Bacolod

Both Flowfile6 having Same Contents in it:-

If both of your FlowFile6 contents are same that means it is an duplicate flowfile then you can use HashContent processor with Detect Duplicate processor to detect the duplicate flowfile contents.

From Detect Duplicate processor connect non-duplicate relation to merge content processor i.e Only one Flowfile6 will be reaches to Merge Content processor instead of two Flowfile6.

Refer the below link how to configure hashcontent and detect duplicate processors

https://community.hortonworks.com/questions/107683/files-detected-twice-with-listfile-processor.html

(or)

Both Flowfile6 having Different Contents in it:-

You can use Update attribute and add a attribute value that is unique for your seven different branches then once you got results from the job then use DetectDuplicate processor

Configs:-

56435-detectduplicate.png

Cache Entry Identifier

${<unique-attribute-name} //attribute name that been created in update attribute processor 

Age Off Duration

<depends how much time that another flowfile will be sent from processor> //time frame between two flowfiles coming out from same triggered

Example:-

lets take the time frame between two flowfile6 is 2 mins then give same time in Age Off Duration, so that detect duplicate processor will cache the entry of flowfile until 2 mins.

By using attribute values and age off duration we can detect the duplicate flowfiles(not depend on contents of flowfile) and remove them before sending flowfiles to MergeContent processor.