- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to use ListSFTP and FetchSFTP to filter lines of files
- Labels:
-
Apache NiFi
Created on
‎09-01-2021
08:50 AM
- last edited on
‎09-01-2021
10:14 PM
by
VidyaSargur
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Everyone, I use ListSFTP and FetchSFTP to collect the files that lines.
I want to filter the files based on the third field.
I want to collect the files that have the year 1995 only in the lines.
|226789|23-Feb-1996|1995|0|1|1|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0
|226780|08-Mar-1996|1996|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|222507|01-Jan-1995|1995|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|22308|01-Jan-1995|1995|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0
|222707|01-Jan-1995|1995|0|1|0|0|0|0|0|0|1|0|0|0|0|0|0|0|1|0|0
Created ‎09-02-2021 09:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First thing I would do is add a new Attribute on my FlowFile that specifies the year I'd be searching for in the lines contained within the content of that FlowFile. (optional)
For example adding an attribute "year" with a value of "1995".
In the routeText processor, I'd then be able to use NiFi Expression Language (NEL) in my java regular expression as supported by this processor component:
^\|(.*?)\|(.*?)\|${year}\|(.*?)$
The above java regular expression will match on lines that begin with a pipe "|" followed by a non greedy wildcard match of one or more character until the very next pipe "|", then again for field 2, then for field three I used NEL which resolves to "1995", and then finally i match via wildcard the remainder of the line.
Of course you could simply put "1995" in place of "${year}" in the above regex.
The routeText processor component configuration would look like this:
The result would be two FlowFiles. One FlowFile would be routed to the relationship "1995" (based on property name used) which would have content only containing lines with "1995". The second FlowFile would route to the "unmatched" relationship and would contain all the non-matching lines ( you may to choose to just auto-terminate this relationship if you don't care about these lines).
If you found these responses addressed your query, please take a moment to login and click on "Accept as Solution" below each response that helped you.
Thank you,
Matt
Created ‎09-01-2021 11:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Justee
ListSFTP only generate a FlowFile with attributes/metadata about the file on the SFTP processor. It does not look at the content itself. So your filtering options are limited to what is in those generated attributes.
The FetchSFTP processor uses these attributes/metadata to retrieve the actual content and add it to the existing FlowFile produced by the ListSFTP processor.
So unfortunately you would need to fetch the all files and then keep on those that contain the desired value in the third field. You may want to look at the RouteText [1] processor for handling these Files after they are the content is fetched.
[1] https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.14.0/org.apach...
If you found this response addressed your query, please take a moment to login and click on "Accept as Solution" below this post.
Thank you,
Matt
Created ‎09-02-2021 02:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @MattWho
What would be the regular expression if I have to put the selection condition on field three of the data.
the field I put in bold. I want to select the lines with the 1995 only.
|226789|23-Feb-1996|1995|0|1|1|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0
|226780|08-Mar-1996|1996|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|222507|01-Jan-1995|1995|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
|22308|01-Jan-1995|1995|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0
|222707|01-Jan-1995|1995|0|1|0|0|0|0|0|0|1|0|0|0|0|0|0|0|1|0|0
Created ‎09-02-2021 09:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First thing I would do is add a new Attribute on my FlowFile that specifies the year I'd be searching for in the lines contained within the content of that FlowFile. (optional)
For example adding an attribute "year" with a value of "1995".
In the routeText processor, I'd then be able to use NiFi Expression Language (NEL) in my java regular expression as supported by this processor component:
^\|(.*?)\|(.*?)\|${year}\|(.*?)$
The above java regular expression will match on lines that begin with a pipe "|" followed by a non greedy wildcard match of one or more character until the very next pipe "|", then again for field 2, then for field three I used NEL which resolves to "1995", and then finally i match via wildcard the remainder of the line.
Of course you could simply put "1995" in place of "${year}" in the above regex.
The routeText processor component configuration would look like this:
The result would be two FlowFiles. One FlowFile would be routed to the relationship "1995" (based on property name used) which would have content only containing lines with "1995". The second FlowFile would route to the "unmatched" relationship and would contain all the non-matching lines ( you may to choose to just auto-terminate this relationship if you don't care about these lines).
If you found these responses addressed your query, please take a moment to login and click on "Accept as Solution" below each response that helped you.
Thank you,
Matt
