Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Validate input file row count in Nifi?

avatar
New Contributor

I have a data ingestion scenario that I am trying to implement in Apache Nifi. The input is a delimited file with a header and a footer. A trivial example is:

itno|col1|col2|col3
20123456|10|50|10
20434561|20|0|20
20342345|10|10|20
F|3

The header contains column names and the footer contains F followed by the row count.

I want to create a Process Group to do the following with the entire delimited file as input FlowFile:

  • discard the header
  • split the rows into separate FlowFiles
  • validate that the row count is correct and reject the entire file if it is wrong

This means that I don't want to emit rows for further processing until all rows are read and the count validated.

Is there an efficient way to do this?

1 ACCEPTED SOLUTION

avatar

Hi @Michael Strasser,

I am not sure the way you are proposing is the best approach : if you have consecutive flow files entering your process group and if the file is split in individual rows, you may have difficulties to keep things clear since you will have flow files representing rows from different files. It is certainly doable though.

However, what I would recommend, at first glance, is to use the ExecuteScript processor and code something in groovy (for example). This way you don't need to split the file, you keep the entire file and you are easily able to reject the whole file if the value is not equal to the row count.

You will find a useful post regarding how to use this processor here : http://funnifi.blogspot.fr/2016/02/executescript-processor-hello-world.html

Let me know if you need additional details.

View solution in original post

3 REPLIES 3

avatar

Hi @Michael Strasser,

I am not sure the way you are proposing is the best approach : if you have consecutive flow files entering your process group and if the file is split in individual rows, you may have difficulties to keep things clear since you will have flow files representing rows from different files. It is certainly doable though.

However, what I would recommend, at first glance, is to use the ExecuteScript processor and code something in groovy (for example). This way you don't need to split the file, you keep the entire file and you are easily able to reject the whole file if the value is not equal to the row count.

You will find a useful post regarding how to use this processor here : http://funnifi.blogspot.fr/2016/02/executescript-processor-hello-world.html

Let me know if you need additional details.

avatar
New Contributor

Hi @Pierre Villard

Thanks for your quick response and the link to Matt Burgess's blog. I think that ExecuteScript will be a good way to go in this case. I have also been told that we might not reject the whole file if the row count is incorrect.

Regarding the linkage between whole files and split flow files, downstream of this process the data will be enriched in some way with information about its source, perhaps via the segment.original.filename property of SplitText.

avatar

Yes you are absolutely right: you have the original source of a split within the attributes.

In fact I'd say that it will be complex to handle a file that you split in rows if you want to perform some logic at file level with the rows. In this case, you want to split rows, and you need all the rows, to confirm a value of one single row. It is possible depending of your use case and using something like the distributed cache but it is easier with ExecuteScript (even more if you want to introduce some logic to reject or not the whole flow file).

In short, it greatly depends of all the actions you want to perform along the flow and how you want to optimize I/O to meet performance expectations.