Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NiFi - how to remove efficiently a line from a big flow file?

avatar
Contributor

Task may seem to be easy, but in fact it isn't...

I have a big flow file (>1GB), from which I need to remove, let's say, first line (header) before further processing.

So far I had 3 attempt, but none of them works as expected:

1) ReplaceText

Works for small files, but the problematic file is too big to load it into memory (I get memory out of bounds exception).

12969-capture.jpg

2) SplitText

I was trying to use SplitText, but due to this issue I cannot skip the header line in this processor at the moment.

In other words - this processor fails whenever Header Line Count > 0.

3) ExecuteProcess

I can imagine running a linux command (e.g. tail or sed) to do this job, but it requires saving the flow file to the disk, which might be also costly.

Do you have any ideas if this can be done more efficiently?

Thanks, Michal

5 REPLIES 5

avatar
Master Guru

I suggest you use SplitText a few times to avoid loading all flow files into memory. Go from 1 million --> 100,000, --> 10,000 --> 1000 --> 1. You can cut those down to as well meaning from 1mil->10 thousand -> 1000 -> 1. Then from there use routeontext and route the header to one point and rest of the lines to another point.

avatar
Super Collaborator

Can you please give us some context. where are you getting this file from? sftp?

wether you explicitly do this or not, the flowfile received in nifi will always be saved to disk. if this can be done easily with Executeprocess, it is a good option and it really will not impact your flows performance. Nifi is very efficient at File IO.

avatar
Contributor

@Karthik Narayanan File is local on NiFi box. I know that ExecuteProcess would be an option but I'd like to avoid saving the file on the disk. How about ExecuteStreamCommand and something like: sed '1d' simple.tsv > noHeader.tsv but in a way: sed '1d' myFlow > myFlow. How I can reference myFlow in this command?

avatar
Contributor

FWIW, the https://issues.apache.org/jira/browse/NIFI-3255 issue has been resolved and is available in the current master.

avatar
New Contributor

Use the ExecuteStreamCommand Processor and use the sed command something like this. This worked for me.

107223-1552662219840.png