Created on 09-01-2021 08:50 AM - last edited on 09-01-2021 10:14 PM by VidyaSargur
Hi Everyone, I use ListSFTP and FetchSFTP to collect the files that lines.
I want to filter the files based on the third field.
I want to collect the files that have the year 1995 only in the lines.
|226789|23-Feb-1996|1995|0|1|1|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0
|226780|08-Mar-1996|1996|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|222507|01-Jan-1995|1995|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|22308|01-Jan-1995|1995|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0
|222707|01-Jan-1995|1995|0|1|0|0|0|0|0|0|1|0|0|0|0|0|0|0|1|0|0
Created 09-02-2021 09:18 AM
First thing I would do is add a new Attribute on my FlowFile that specifies the year I'd be searching for in the lines contained within the content of that FlowFile. (optional)
For example adding an attribute "year" with a value of "1995".
In the routeText processor, I'd then be able to use NiFi Expression Language (NEL) in my java regular expression as supported by this processor component:
^\|(.*?)\|(.*?)\|${year}\|(.*?)$
The above java regular expression will match on lines that begin with a pipe "|" followed by a non greedy wildcard match of one or more character until the very next pipe "|", then again for field 2, then for field three I used NEL which resolves to "1995", and then finally i match via wildcard the remainder of the line.
Of course you could simply put "1995" in place of "${year}" in the above regex.
The routeText processor component configuration would look like this:
The result would be two FlowFiles. One FlowFile would be routed to the relationship "1995" (based on property name used) which would have content only containing lines with "1995". The second FlowFile would route to the "unmatched" relationship and would contain all the non-matching lines ( you may to choose to just auto-terminate this relationship if you don't care about these lines).
If you found these responses addressed your query, please take a moment to login and click on "Accept as Solution" below each response that helped you.
Thank you,
Matt
Created 09-01-2021 11:47 AM
@Justee
ListSFTP only generate a FlowFile with attributes/metadata about the file on the SFTP processor. It does not look at the content itself. So your filtering options are limited to what is in those generated attributes.
The FetchSFTP processor uses these attributes/metadata to retrieve the actual content and add it to the existing FlowFile produced by the ListSFTP processor.
So unfortunately you would need to fetch the all files and then keep on those that contain the desired value in the third field. You may want to look at the RouteText [1] processor for handling these Files after they are the content is fetched.
[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.14.0/org.apach...
If you found this response addressed your query, please take a moment to login and click on "Accept as Solution" below this post.
Thank you,
Matt
Created 09-02-2021 02:28 AM
Hi @MattWho
What would be the regular expression if I have to put the selection condition on field three of the data.
the field I put in bold. I want to select the lines with the 1995 only.
|226789|23-Feb-1996|1995|0|1|1|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0
|226780|08-Mar-1996|1996|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|222507|01-Jan-1995|1995|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|22308|01-Jan-1995|1995|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0
|222707|01-Jan-1995|1995|0|1|0|0|0|0|0|0|1|0|0|0|0|0|0|0|1|0|0
Created 09-02-2021 09:18 AM
First thing I would do is add a new Attribute on my FlowFile that specifies the year I'd be searching for in the lines contained within the content of that FlowFile. (optional)
For example adding an attribute "year" with a value of "1995".
In the routeText processor, I'd then be able to use NiFi Expression Language (NEL) in my java regular expression as supported by this processor component:
^\|(.*?)\|(.*?)\|${year}\|(.*?)$
The above java regular expression will match on lines that begin with a pipe "|" followed by a non greedy wildcard match of one or more character until the very next pipe "|", then again for field 2, then for field three I used NEL which resolves to "1995", and then finally i match via wildcard the remainder of the line.
Of course you could simply put "1995" in place of "${year}" in the above regex.
The routeText processor component configuration would look like this:
The result would be two FlowFiles. One FlowFile would be routed to the relationship "1995" (based on property name used) which would have content only containing lines with "1995". The second FlowFile would route to the "unmatched" relationship and would contain all the non-matching lines ( you may to choose to just auto-terminate this relationship if you don't care about these lines).
If you found these responses addressed your query, please take a moment to login and click on "Accept as Solution" below each response that helped you.
Thank you,
Matt