Support Questions

Find answers, ask questions, and share your expertise

Order of files MergeContent processor

avatar
Explorer

Hello, I am new to nifi
Suppose that I have multiple file to be merged, named by 0001_0, 0002_0, 0003_0 and so on
and the number of files to be merged is determined by original file which is not the thing we can control
my question is that how to use MergeContent processor to do merge
Thanks

10 REPLIES 10

avatar
Master Guru
@Yu-An Chen

You can use either of Merge Strategies

  1. 'Defragment' algorithm combines fragments that are associated by attributes back into a single cohesive FlowFile.
  2. The 'Bin-Packing Algorithm' generates a FlowFile populated by arbitrarily chosen FlowFiles and there are bunch of other properties that we need configure based on what you are trying to achieve.

Please refer to the below HCC threads regarding MergeContent processor usage and configurations

https://community.hortonworks.com/questions/64337/apache-nifi-merge-content.html

https://community.hortonworks.com/questions/161827/mergeprocessor-nifi-using-the-correlation-attribu...

https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.ht...

Let us know if you are facing any issues ..!!

avatar
Master Guru
@Yu-An Chen

Could you please add more details(like flow/config screenshots,sample input data,expected output) regarding what you are trying to achieve, so that we can understand your requirements clearly.

avatar
Explorer

Here is flow (flow.jpg):

flow.jpg
it can be divided into two parts:
1. First part (above) is doing some Hive like language to generate some files. Please refer to result.jpg for a look

result.jpg

2. Second part (bottom) is doing merge process and send the final data to SFTP. But the original data is divided into several parts. If I do MergeContent processor with Bin-Packing Algorithm, it will merge the file with the order like 1→0→2, however, I want it in the order of 0→1→2
How can I achieve this propose?

avatar
Master Guru

@Yu-An Chen

Instead of using GetHDFS processor use List/Fetch HDFS processors and then use MergeContent processor for merge.

ListHDFS processor stores the state and check only for the new files that are created after the stored state.

(or)

Create Hive table on top of this HDFS directory then use SelectHiveQL processor with your query

select * from <db.name>.<tab_name> order by <field-name> asc

Then you don't need to use MergeContent processor you can directly Put the result of SelectHiveQL processor to PutSFTP.

(or)

Once the merge is completed if you want to order by some field on the flowfile content then you can use QueryRecord processor and add new dynamic property with the value like

select * from flowfile order by <field-name> asc

then use the relationship to connect to PutSFTP processor.

https://community.hortonworks.com/articles/121794/running-sql-on-flowfiles-using-queryrecord-process...

(or)

Consider EnforceOrder processor before MergeContent processor this processor enforces the order of flowfiles reaching to MergeContent processor.

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache...

-

If the Answer addressed your question, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

avatar
Explorer

@Shu

I have tried the method 1 as you suggested.
The overall flow is ListHDFS→ RouteonAttribute→ FetchHDFS→ MergeContent→PutSFTP

I try to use 'RouteonAttribute' as a filter and I add a new property which is filetofetch and I fill it with '${filename:contains('000')}'
But the question is how to fill the HDFS Filename in the processor of FetchHDFS?

avatar
Master Guru
@Yu-An Chen

The issue is relationships that are feeding from FetchHDFS to mergeContent processors are commas.failure,failure, please use Success relation to feed MergeContent processor

Flow:

77559-flow.png

please change the feeding relationships as per the above screenshot.

In addition save and upload this template to your instance for more reference and configure MergeContent/PutSFTP processors

order-files-merge-content-194166.xml

avatar
Master Guru

@Yu-An Chen

Keep FetchHDFS processor configs as below.

77546-fetchhdfs.png

Give core-site.xml,hdfs-site.xml path in Hadoop Configuration Resources property.

Keep HDFS Filename property value as

${path}/${filename}

${path},${filename}are the attributes that needed by FetchHDFS processor to fetch the file from hdfs directory.

These attributes are going to be associated with the each flowfile, added by ListHDFS processor.

avatar
Explorer

@Shu

Yes, I add a RouteonAttribute processor to filter some file that I need
but, the question is that I don't know how to fill the expression language for fetchHDFS
I have tried this one but it doesn't work...filetofetch.jpg

avatar
Explorer

@Shu

I have tried the setting as you mentioned
and the processor of fetchHDFS is with input flow
but there is no output flow... (I have waited for several minutes)
could you help me to take a look of my setting
flow flow.jpg
listHDFS listhdfs.jpg
fetchHDFS fetchhdfs.jpg

Thanks for your lots of helping