Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

combine two files in nifi

avatar
Explorer

i am trying to combine two csv files into one file using nifi but however i am not able to do the same as nifi performs action on only 1 file at a time. Is there any way out for the same , how can i combine the two files in nifi

1 ACCEPTED SOLUTION

avatar
@Tinkle Mahendru

Have you tried using the MergeContent processor?

View solution in original post

6 REPLIES 6

avatar
@Tinkle Mahendru

Have you tried using the MergeContent processor?

avatar
Super Mentor

@Tinkle Mahendru

By default the MergeContent processor is configured with "Minimum Number of Entries" set to 1. When the processor run it looks at the current incoming queue and bins FlowFiles based on its configuration. if the incoming queue only has 1 FlowFile in it at the time it runs it will merge because it met the min value above (basically file passed through with no merging). So you will need to play around wit the configs for this processor to get your desired results based on your dataflow needs/volume.

Matt

avatar
Super Mentor

@Tinkle Mahendru

How do you identify your files as containing CSV data without looking at each file's content?

Does the filename indicate that it contains CSV data?

Assuming all your csv files have a csv filename extension on them, you could use the routeOnAttribute processor to route on files whose filename ends in .csv to your mergeContent processor. All other FlowFiles with a filename not ending in .csv could then be routed elsewhere in your dataflow.

You would add a new custom property as follows to the routeOnAttribute processor:

15780-screen-shot-2017-05-25-at-83251-am.png

Each added dynamic property becomes a new relationship for this processor.

Lets say there is no extension, you may be able to use the RouteOnContent processor to look at the content of each FlowFile for an indicator that it is CSV data and route that way. Of course reading content versus evaluating attributes is more expensive operation in terms of resources.

The MergeContent processor has virtual bins where it groups incoming FlowFiles before merging all the FlowFiles assigned to that bin. The Correlation Attribute property provides a way for you to control what FlowFile are put in which bin. FlowFiles are made up of FlowFile Attributes (key value pairs - basically metadata) and FlowFile content (your actual data). You can use various processors (ie. updateAttribute) to add and manipulate FlowFile attributes on a FlowFile. If you configure your MergeContent processor to use a correlation Attribute, NiFi will look for the attribute key you specify and bin files with the exact same value into the same virtual bin. I do not believe this is what you are looking for here to solve your use case.

While there are scripting processor available in NiFi that can be used to execute your won script against a FlowFile, they are designed to operate against one FlowFile at a time. You could maybe use a putFile processor to write your CSV files to disk and the use one of the scripting processors to merge them.

Another option to is to write your own custom NiFi processor that is specifically designed to merge CSV files.

https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html

If you feel we have successfully answered your question, please mark an answer as accepted.

Thank you,

Matt

avatar

@Tinkle Mahendru

Take a look at the example Nifi workflow template in the link below (SplitRouteMerge.xml):

https://cwiki.apache.org/confluence/download/attachments/57904847/SplitRouteMerge.xml?version=1&modi...

This flow demonstrates splitting a file(s) on line boundaries, routing the splits based on a regex in the content, and then merging the files together for storage somewhere. It will give you a good idea on how to process and merge your files.

avatar
Explorer

thanks a lot for the reponse @Eyad Garelnabi , @Matt and Wynner.

I have one quick question , how would i get only the csv files out of thousands of other files , if i am not using get file filter , is there anyway in the merge content processor, may be i can mention something in correlation attribute name , i am not sure about what excatly correlation attribute property is doing , and is there anyway i can use csvjoin from (csvkit) in nifi for the same requirement.

Thanks in advance !!!

Any help is much appreciated.

avatar

Please take a look at @Matt Clarke's response above on how to extract csv files only. It is the most straight forward way.