Created 05-17-2016 05:20 AM
I have a data ingestion scenario that I am trying to implement in Apache Nifi. The input is a delimited file with a header and a footer. A trivial example is:
itno|col1|col2|col3 20123456|10|50|10 20434561|20|0|20 20342345|10|10|20 F|3
The header contains column names and the footer contains F followed by the row count.
I want to create a Process Group to do the following with the entire delimited file as input FlowFile:
This means that I don't want to emit rows for further processing until all rows are read and the count validated.
Is there an efficient way to do this?
Created 05-17-2016 06:38 AM
I am not sure the way you are proposing is the best approach : if you have consecutive flow files entering your process group and if the file is split in individual rows, you may have difficulties to keep things clear since you will have flow files representing rows from different files. It is certainly doable though.
However, what I would recommend, at first glance, is to use the ExecuteScript processor and code something in groovy (for example). This way you don't need to split the file, you keep the entire file and you are easily able to reject the whole file if the value is not equal to the row count.
You will find a useful post regarding how to use this processor here : http://funnifi.blogspot.fr/2016/02/executescript-processor-hello-world.html
Let me know if you need additional details.
Created 05-17-2016 06:38 AM
I am not sure the way you are proposing is the best approach : if you have consecutive flow files entering your process group and if the file is split in individual rows, you may have difficulties to keep things clear since you will have flow files representing rows from different files. It is certainly doable though.
However, what I would recommend, at first glance, is to use the ExecuteScript processor and code something in groovy (for example). This way you don't need to split the file, you keep the entire file and you are easily able to reject the whole file if the value is not equal to the row count.
You will find a useful post regarding how to use this processor here : http://funnifi.blogspot.fr/2016/02/executescript-processor-hello-world.html
Let me know if you need additional details.
Created 05-17-2016 06:53 AM
Thanks for your quick response and the link to Matt Burgess's blog. I think that ExecuteScript will be a good way to go in this case. I have also been told that we might not reject the whole file if the row count is incorrect.
Regarding the linkage between whole files and split flow files, downstream of this process the data will be enriched in some way with information about its source, perhaps via the segment.original.filename property of SplitText.
Created 05-17-2016 07:09 AM
Yes you are absolutely right: you have the original source of a split within the attributes.
In fact I'd say that it will be complex to handle a file that you split in rows if you want to perform some logic at file level with the rows. In this case, you want to split rows, and you need all the rows, to confirm a value of one single row. It is possible depending of your use case and using something like the distributed cache but it is easier with ExecuteScript (even more if you want to introduce some logic to reject or not the whole flow file).
In short, it greatly depends of all the actions you want to perform along the flow and how you want to optimize I/O to meet performance expectations.