Created 01-12-2021 04:10 PM
Hi ,
I have multiple incoming files containing several lines of text of which I need certain lines from each of those files to create a table in the next flow file stages. However, the regex expression I wrote using regex101.com clearly tells me that the regex works but when I put the same regex in the nifi processor - RouteText processor, it fails to extract any part of it. I don't understand if the regex is wrong or the flow is wrong. Can some one please advise me on the same??
Below are some of the lines from the text file which I need to be extracted which start from the number 0 till 16 and put it in another text file containing only those lines (from 0 to 16) with the same file name from which it read the data.
wwwwww aa cc
# Name foo Since ddd/www dddd
-- --------- ----- --------------- --- --- ---------
0 abc-lr1-0 35189 20-Dec 03:43:54
1 abc-rr2-g 35209 20-Dec 03:43:54
* 2 abc-rr1-0 35185 20-Dec 03:43:54
*15 abc-lr2-0 34686 20-Dec 03:43:54
16 abc-lr1-0 34631 20-Dec 03:43:54
Below is the regex expression I wrote:
\d{0,2}\sabc-\w{0,2}\d{0,2}-\d{0,2}\w{0,2}\s\d{0,6}\s\d{0,2}-\w{0,3}\s\d{0,2}\:\d{0,2}\:\d{0,2}
Somehow the Route Text processor isn't able to recognise the regex. Any help is appreciable here.
Thanks in advance
Created 01-13-2021 06:04 AM
It may be helpful if you shared your RouteText processor configuration.
Correct me if I am wrong, but you are looking to have all lines (minus the header lines) placed in a new FlowFile by themselves.
Using you example data and the regex you provided.
wwwwww aa cc
# Name foo Since ddd/www dddd
-- --------- ----- --------------- --- --- ---------
0 abc-lr1-0 35189 20-Dec 03:43:54
1 abc-rr2-g 35209 20-Dec 03:43:54
* 2 abc-rr1-0 35185 20-Dec 03:43:54
* 15 abc-lr2-0 34686 20-Dec 03:43:54
16 abc-lr1-0 34631 20-Dec 03:43:54
The above would result in a FlowFile with only lines 0,1, and 16. The header plus lines 2 and 15 would route to unmatched because of the leading "*" which does not match your regex.
Result would be a FlowFile with:
0 abc-lr1-0 35189 20-Dec 03:43:54
1 abc-rr2-g 35209 20-Dec 03:43:54
16 abc-lr1-0 34631 20-Dec 03:43:54
Couple things to check if what you are seeing is entire original FlowFile getting routed to the "Original" and "Unmatched" relationships:
1. RouteText processor configuration. If i understand your use case correctly, it should be configured like this:
2. I noticed you sample data has leading and trailing whitespace so make sure processor is configured to ignore those.
3. Since you intent is produce a new FlowFile with only the lines matching the regex, make sure you set the above Routing Strategy.
4. Make sure the correct matching strategy is selected. Should be what I have above.
5. Click on the "+" to add a new dynamic property for your regex, The property name becomes a new relationship on the processor where your matching lines will be routed.
6. Since you are evaluating the source FlowFile content line-by-line, make sure your regex does not have a line return at the end of it.
Correct:
Incorrect (notice the line 2 which indicates a line return at end of regex):
When I ran a little test flow using your sample data and regex, I got the desired results:
The "lines" relationship has one new FlowFile with content of only the 3 matching lines
The "unmatched" relationship contains a new FlowFile with content containing all the unmatched lines.
The "original" relationship contains the original FlowFile that was processed by this processor.
If you don't care about the original or unmatched FlowFiles, you can simply auto-terminate those relationships instead of routing them out of the processor in connections as I did above.
Hope this helps,
Matt
Created 01-13-2021 07:01 AM
Hi Matt,
Many thanks for sharing your views. Please see my replies below to your questions:
- It may be helpful if you shared your RouteText processor configuration. -->
I have the exact same configuration as stated in your image. But instead of "lines", I gave the attribute name as "matched". Thats the only difference. (Pardon me for not able to share the actual image as the system in which I did was the actual server not having the internet connection)
- Correct me if I am wrong, but you are looking to have all lines (minus the header lines) placed in a new FlowFile by themselves. --> Yes
- The above would result in a FlowFile with only lines 0,1, and 16. The header plus lines 2 and 15 would route to unmatched because of the leading "*" which does not match your regex. --> Yes, thats correct. It would omit lines 2 and 15 but I need them as well. But if you see, the line
* 2 abc-rr1-0 35185 20-Dec 03:43:54
Has a space after " * " and then the digit 2.
But for the line:
*15 abc-lr2-0 34686 20-Dec 03:43:54
there is no space after the " * " and then comes the digit 15.
Hence to include this, I reconfigured the regex expression again as shared below:
(\s|[ \t]|\*)\d{1,2}\s\w{1,4}\-\w{1,4}\-(\w{1,2}|\w{0})\s\d{1,5}\s\d{1,2}\-\w{1,3}\s\d{1,2}\:\d{1,2}\:\d{1,2}
Please see the below link for the regex I made (not sure if you like it 😞 but I tried)
https://regex101.com/r/pdo6Ca/1
2. I noticed you sample data has leading and trailing whitespace so make sure processor is configured to ignore those. --> I reworked on the regex again whose link is above to include/ignore those(https://regex101.com/r/pdo6Ca/1 ) Please share your views if this regex is wrong.
3. Since you intent is produce a new FlowFile with only the lines matching the regex, make sure you set the above Routing Strategy. --> Yes, it is set exactly as stated in your image but instead of creating one flow file with all matched lines, it creates multiple flow files with each matched lines printed in each same name flow file....multiple times, that too, sent to "unmatched" relationship
4. Make sure the correct matching strategy is selected. Should be what I have above. --> Yes, it is set exactly as stated in your image
5. Click on the "+" to add a new dynamic property for your regex, The property name becomes a new relationship on the processor where your matching lines will be routed. --> Yes, correct
6. Since you are evaluating the source FlowFile content line-by-line, make sure your regex does not have a line return at the end of it. --> I had double checked and no extra return is there
After all this, the issue I am having is that, when the flow file is sent to this processor, it is not sending any flow file values to the "lines" relationship or "matched" relationship (name I set in my processor).
Instead, there happens two things:
- It creates same flow file name (e.g. abc.txt) multiple times towards the "unmatched" relationship. So if the regex expression matches 7 lines from the input file coming from the previous processor, then it would create 7 abc.txt flow files routed towards "unmatched" relationship wherein each of those 7 files will contain 1 matched line each (different values captured I meant)
- It creates 1 flow file sent to "original" relationship.
It would be great if you can share your advise on the below aspect of mine:
- It creates same flow file name (e.g. abc.txt) multiple times towards the "unmatched" relationship. So if the regex expression matches 7 lines from the input file coming from the previous processor, then it would create 7 abc.txt flow files routed towards "unmatched" relationship wherein each of those 7 files will contain 1 line each (different values captured I meant)
Thank you again for your time and advise.
Regards,
Created 01-13-2021 08:45 AM
RouteText does not modify the content of the lines. It only routes lines to different produced new FlowFiles. The content of those lines remains unchanged. The RouteText processor also does nothing with capture groups, so the entire regex is going to be evaluated against each line.
I took the entire content from your "https://regex101.com/r/pdo6Ca/1" and the entire new regex from same and ran this flow.
I produced only one source FlowFile as you can see processor in = 1
You can see it routed "out" three FlowFiles.
One to the connection with the "matched" relationship which contains only one line since only one line matched the entire regex. (regex101 is taking in to account your capture groups)
If you change:
And run same test again, you will see a few more lines match (those with the additional "yes yes ..." in the lines).
I attached template i used which you can import to your NiFi.
Community does not support .xml files so i changed extension to .txt. You will need to change extension back to .xml before you can import the template in to your NiFi.
Hope this helps.
Created 01-13-2021 08:46 AM
Blocks my attachment (sorry)
Created 01-13-2021 11:03 AM
@Fierymech
Here is the raw XML for the template:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<template encoding-version="1.3">
<description></description>
<groupId>658f3a7f-0171-1000-0000-00007706d23d</groupId>
<name>RouteText-Example</name>
<snippet>
<connections>
<id>44a4a771-a46f-3b74-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
<backPressureObjectThreshold>10000</backPressureObjectThreshold>
<bends>
<x>0.0</x>
<y>408.0</y>
</bends>
<destination>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>e81dcc20-cb6f-3466-0000-000000000000</id>
<type>PROCESSOR</type>
</destination>
<flowFileExpiration>0 sec</flowFileExpiration>
<labelIndex>1</labelIndex>
<loadBalanceCompression>DO_NOT_COMPRESS</loadBalanceCompression>
<loadBalancePartitionAttribute></loadBalancePartitionAttribute>
<loadBalanceStatus>LOAD_BALANCE_NOT_CONFIGURED</loadBalanceStatus>
<loadBalanceStrategy>DO_NOT_LOAD_BALANCE</loadBalanceStrategy>
<name></name>
<selectedRelationships>original</selectedRelationships>
<source>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>696c796c-aa86-3e71-0000-000000000000</id>
<type>PROCESSOR</type>
</source>
<zIndex>0</zIndex>
</connections>
<connections>
<id>7bcc4c55-39cf-3b4c-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
<backPressureObjectThreshold>10000</backPressureObjectThreshold>
<destination>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>696c796c-aa86-3e71-0000-000000000000</id>
<type>PROCESSOR</type>
</destination>
<flowFileExpiration>0 sec</flowFileExpiration>
<labelIndex>1</labelIndex>
<loadBalanceCompression>DO_NOT_COMPRESS</loadBalanceCompression>
<loadBalancePartitionAttribute></loadBalancePartitionAttribute>
<loadBalanceStatus>LOAD_BALANCE_NOT_CONFIGURED</loadBalanceStatus>
<loadBalanceStrategy>DO_NOT_LOAD_BALANCE</loadBalanceStrategy>
<name></name>
<selectedRelationships>success</selectedRelationships>
<source>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>7386cd31-1cfd-3e04-0000-000000000000</id>
<type>PROCESSOR</type>
</source>
<zIndex>0</zIndex>
</connections>
<connections>
<id>dde64489-3d09-3e4f-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
<backPressureObjectThreshold>10000</backPressureObjectThreshold>
<bends>
<x>480.0</x>
<y>408.0</y>
</bends>
<destination>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>e81dcc20-cb6f-3466-0000-000000000000</id>
<type>PROCESSOR</type>
</destination>
<flowFileExpiration>0 sec</flowFileExpiration>
<labelIndex>1</labelIndex>
<loadBalanceCompression>DO_NOT_COMPRESS</loadBalanceCompression>
<loadBalancePartitionAttribute></loadBalancePartitionAttribute>
<loadBalanceStatus>LOAD_BALANCE_NOT_CONFIGURED</loadBalanceStatus>
<loadBalanceStrategy>DO_NOT_LOAD_BALANCE</loadBalanceStrategy>
<name></name>
<selectedRelationships>unmatched</selectedRelationships>
<source>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>696c796c-aa86-3e71-0000-000000000000</id>
<type>PROCESSOR</type>
</source>
<zIndex>0</zIndex>
</connections>
<connections>
<id>f2b4b8ba-09e7-3125-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<backPressureDataSizeThreshold>1 GB</backPressureDataSizeThreshold>
<backPressureObjectThreshold>10000</backPressureObjectThreshold>
<bends>
<x>240.0</x>
<y>408.0</y>
</bends>
<destination>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>e81dcc20-cb6f-3466-0000-000000000000</id>
<type>PROCESSOR</type>
</destination>
<flowFileExpiration>0 sec</flowFileExpiration>
<labelIndex>1</labelIndex>
<loadBalanceCompression>DO_NOT_COMPRESS</loadBalanceCompression>
<loadBalancePartitionAttribute></loadBalancePartitionAttribute>
<loadBalanceStatus>LOAD_BALANCE_NOT_CONFIGURED</loadBalanceStatus>
<loadBalanceStrategy>DO_NOT_LOAD_BALANCE</loadBalanceStrategy>
<name></name>
<selectedRelationships>matched</selectedRelationships>
<source>
<groupId>2edcec51-fc4e-38cf-0000-000000000000</groupId>
<id>696c796c-aa86-3e71-0000-000000000000</id>
<type>PROCESSOR</type>
</source>
<zIndex>0</zIndex>
</connections>
<processors>
<id>19e6484f-37ca-3639-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<position>
<x>736.0</x>
<y>0.0</y>
</position>
<bundle>
<artifact>nifi-standard-nar</artifact>
<group>org.apache.nifi</group>
<version>1.12.1.3.5.2.0-99</version>
</bundle>
<config>
<bulletinLevel>WARN</bulletinLevel>
<comments></comments>
<concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
<descriptors>
<entry>
<key>Routing Strategy</key>
<value>
<name>Routing Strategy</name>
</value>
</entry>
<entry>
<key>Matching Strategy</key>
<value>
<name>Matching Strategy</name>
</value>
</entry>
<entry>
<key>Character Set</key>
<value>
<name>Character Set</name>
</value>
</entry>
<entry>
<key>Ignore Leading/Trailing Whitespace</key>
<value>
<name>Ignore Leading/Trailing Whitespace</name>
</value>
</entry>
<entry>
<key>Ignore Case</key>
<value>
<name>Ignore Case</name>
</value>
</entry>
<entry>
<key>Grouping Regular Expression</key>
<value>
<name>Grouping Regular Expression</name>
</value>
</entry>
<entry>
<key>lines</key>
<value>
<name>lines</name>
</value>
</entry>
<entry>
<key>matched</key>
<value>
<name>matched</name>
</value>
</entry>
</descriptors>
<executionNode>ALL</executionNode>
<lossTolerant>false</lossTolerant>
<penaltyDuration>30 sec</penaltyDuration>
<properties>
<entry>
<key>Routing Strategy</key>
<value>Route to each matching Property Name</value>
</entry>
<entry>
<key>Matching Strategy</key>
<value>Contains Regular Expression</value>
</entry>
<entry>
<key>Character Set</key>
<value>UTF-8</value>
</entry>
<entry>
<key>Ignore Leading/Trailing Whitespace</key>
<value>true</value>
</entry>
<entry>
<key>Ignore Case</key>
<value>false</value>
</entry>
<entry>
<key>Grouping Regular Expression</key>
</entry>
<entry>
<key>lines</key>
<value>\d{0,2}\sabc-\w{0,2}\d{0,2}-\d{0,2}\w{0,2}\s\d{0,6}\s\d{0,2}-\w{0,3}\s\d{0,2}\:\d{0,2}\:\d{0,2}</value>
</entry>
<entry>
<key>matched</key>
<value>(\s|[ \t]|\*)\d{1,2}\s\w{1,4}\-\w{1,4}\-(\w{1,2}|\w{0})\s\d{1,5}\s\d{1,2}\-\w{1,3}\s\d{1,2}\:\d{1,2}\:\d{1,2}</value>
</entry>
</properties>
<runDurationMillis>0</runDurationMillis>
<schedulingPeriod>0 sec</schedulingPeriod>
<schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
<yieldDuration>1 sec</yieldDuration>
</config>
<executionNodeRestricted>false</executionNodeRestricted>
<name>RouteText</name>
<relationships>
<autoTerminate>false</autoTerminate>
<name>lines</name>
</relationships>
<relationships>
<autoTerminate>false</autoTerminate>
<name>matched</name>
</relationships>
<relationships>
<autoTerminate>false</autoTerminate>
<name>original</name>
</relationships>
<relationships>
<autoTerminate>false</autoTerminate>
<name>unmatched</name>
</relationships>
<state>STOPPED</state>
<style/>
<type>org.apache.nifi.processors.standard.RouteText</type>
</processors>
<processors>
<id>696c796c-aa86-3e71-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<position>
<x>64.0</x>
<y>248.0</y>
</position>
<bundle>
<artifact>nifi-standard-nar</artifact>
<group>org.apache.nifi</group>
<version>1.12.1.3.5.2.0-99</version>
</bundle>
<config>
<bulletinLevel>WARN</bulletinLevel>
<comments></comments>
<concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
<descriptors>
<entry>
<key>Routing Strategy</key>
<value>
<name>Routing Strategy</name>
</value>
</entry>
<entry>
<key>Matching Strategy</key>
<value>
<name>Matching Strategy</name>
</value>
</entry>
<entry>
<key>Character Set</key>
<value>
<name>Character Set</name>
</value>
</entry>
<entry>
<key>Ignore Leading/Trailing Whitespace</key>
<value>
<name>Ignore Leading/Trailing Whitespace</name>
</value>
</entry>
<entry>
<key>Ignore Case</key>
<value>
<name>Ignore Case</name>
</value>
</entry>
<entry>
<key>Grouping Regular Expression</key>
<value>
<name>Grouping Regular Expression</name>
</value>
</entry>
<entry>
<key>matched</key>
<value>
<name>matched</name>
</value>
</entry>
</descriptors>
<executionNode>ALL</executionNode>
<lossTolerant>false</lossTolerant>
<penaltyDuration>30 sec</penaltyDuration>
<properties>
<entry>
<key>Routing Strategy</key>
<value>Route to each matching Property Name</value>
</entry>
<entry>
<key>Matching Strategy</key>
<value>Matches Regular Expression</value>
</entry>
<entry>
<key>Character Set</key>
<value>UTF-8</value>
</entry>
<entry>
<key>Ignore Leading/Trailing Whitespace</key>
<value>true</value>
</entry>
<entry>
<key>Ignore Case</key>
<value>false</value>
</entry>
<entry>
<key>Grouping Regular Expression</key>
</entry>
<entry>
<key>matched</key>
<value>(\s|[ \t]|\*)\d{1,2}\s\w{1,4}\-\w{1,4}\-(\w{1,2}|\w{0})\s\d{1,5}\s\d{1,2}\-\w{1,3}\s\d{1,2}\:\d{1,2}\:\d{1,2}</value>
</entry>
</properties>
<runDurationMillis>0</runDurationMillis>
<schedulingPeriod>0 sec</schedulingPeriod>
<schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
<yieldDuration>1 sec</yieldDuration>
</config>
<executionNodeRestricted>false</executionNodeRestricted>
<name>RouteText</name>
<relationships>
<autoTerminate>false</autoTerminate>
<name>matched</name>
</relationships>
<relationships>
<autoTerminate>false</autoTerminate>
<name>original</name>
</relationships>
<relationships>
<autoTerminate>false</autoTerminate>
<name>unmatched</name>
</relationships>
<state>STOPPED</state>
<style/>
<type>org.apache.nifi.processors.standard.RouteText</type>
</processors>
<processors>
<id>7386cd31-1cfd-3e04-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<position>
<x>56.0</x>
<y>40.0</y>
</position>
<bundle>
<artifact>nifi-standard-nar</artifact>
<group>org.apache.nifi</group>
<version>1.12.1.3.5.2.0-99</version>
</bundle>
<config>
<bulletinLevel>WARN</bulletinLevel>
<comments></comments>
<concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
<descriptors>
<entry>
<key>File Size</key>
<value>
<name>File Size</name>
</value>
</entry>
<entry>
<key>Batch Size</key>
<value>
<name>Batch Size</name>
</value>
</entry>
<entry>
<key>Data Format</key>
<value>
<name>Data Format</name>
</value>
</entry>
<entry>
<key>Unique FlowFiles</key>
<value>
<name>Unique FlowFiles</name>
</value>
</entry>
<entry>
<key>generate-ff-custom-text</key>
<value>
<name>generate-ff-custom-text</name>
</value>
</entry>
<entry>
<key>character-set</key>
<value>
<name>character-set</name>
</value>
</entry>
<entry>
<key>mime-type</key>
<value>
<name>mime-type</name>
</value>
</entry>
</descriptors>
<executionNode>PRIMARY</executionNode>
<lossTolerant>false</lossTolerant>
<penaltyDuration>30 sec</penaltyDuration>
<properties>
<entry>
<key>File Size</key>
<value>0B</value>
</entry>
<entry>
<key>Batch Size</key>
<value>1</value>
</entry>
<entry>
<key>Data Format</key>
<value>Text</value>
</entry>
<entry>
<key>Unique FlowFiles</key>
<value>false</value>
</entry>
<entry>
<key>generate-ff-custom-text</key>
<value> Capable Qd Tx
# Name S/N Since Mas/Sla bytes
-- --------- ----- --------------- --- --- ---------
0 hub-lr1-0 35189 20-Dec 03:43:54
1 lr2-27-27 35209 20-Dec 03:43:54
2 dt27-kcd- 35185 20-Dec 03:43:54
* 3 rr1-2627- 34748 20-Dec 03:43:54 yes yes 0
4 hub-rr2-g 34609 20-Dec 03:43:54
5 hub-lr2-0 34686 20-Dec 03:43:54
6 hub-lr1-0 34631 20-Dec 03:43:54
7 hub-rr3-g 34692 20-Dec 03:43:54
8 hub-rr3-g 34568 20-Dec 03:43:54
9 hub-rr2-g 35203 20-Dec 03:43:54
10 hub-rr2-g 35200 20-Dec 03:43:54
11 hub-lr1-0 35205 20-Dec 03:43:54
12 hub-rr1-0 34394 20-Dec 03:43:54
13 hub-rr3-g 35191 20-Dec 03:43:54
14 hub-lr2-0 35196 20-Dec 03:43:54
15 hub-lr1-0 35214 20-Dec 03:43:54
16 hub-rr1-0 34577 20-Dec 03:43:54
*17 hub-rr3-g 35217 20-Dec 03:43:56
Logs for Radio IP xx.xx.xx.xx
telnet> Trying xx.xx.xx.xx...
Logs for Radio IP xx.xx.xx.xx
telnet> Trying xx.xx.xx.xx...
Connected to xx.xx.xx.xx.
Escape character is '^]'.
Capable Qd Tx
# Name S/N Since Mas/Sla bytes
-- --------- ----- --------------- --- --- ---------
0 hub-lr1-0 35189 20-Dec 03:43:54
1 hub-rr2-g 35209 20-Dec 03:43:54
2 hub-rr1-0 35185 20-Dec 03:43:54
3 hub-rr1-0 34748 20-Dec 03:43:54
4 hub-rr2-g 34609 20-Dec 03:43:54
5 hub-lr2-0 34686 20-Dec 03:43:54
6 hub-lr1-0 34631 20-Dec 03:43:54
7 hub-rr3-g 34692 20-Dec 03:43:54
8 hub-rr3-g 34568 20-Dec 03:43:54
9 hub-rr2-g 35203 20-Dec 03:43:54
10 hub-rr2-g 35200 20-Dec 03:43:54
11 hub-lr1-0 35205 20-Dec 03:43:54
12 hub-rr1-0 34394 20-Dec 03:43:54
13 hub-rr3-g 35191 20-Dec 03:43:54
14 hub-lr2-0 35196 20-Dec 03:43:54
15 hub-lr1-0 35214 20-Dec 03:43:54
16 hub-rr1-0 34577 20-Dec 03:43:54
*17 hub-rr3-g 35217 20-Dec 03:43:54 yes yes 0 </value>
</entry>
<entry>
<key>character-set</key>
<value>UTF-8</value>
</entry>
<entry>
<key>mime-type</key>
</entry>
</properties>
<runDurationMillis>0</runDurationMillis>
<schedulingPeriod>60 sec</schedulingPeriod>
<schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
<yieldDuration>1 sec</yieldDuration>
</config>
<executionNodeRestricted>false</executionNodeRestricted>
<name>GenerateFlowFile</name>
<relationships>
<autoTerminate>false</autoTerminate>
<name>success</name>
</relationships>
<state>STOPPED</state>
<style/>
<type>org.apache.nifi.processors.standard.GenerateFlowFile</type>
</processors>
<processors>
<id>e81dcc20-cb6f-3466-0000-000000000000</id>
<parentGroupId>2edcec51-fc4e-38cf-0000-000000000000</parentGroupId>
<position>
<x>64.0</x>
<y>440.0</y>
</position>
<bundle>
<artifact>nifi-update-attribute-nar</artifact>
<group>org.apache.nifi</group>
<version>1.12.1.3.5.2.0-99</version>
</bundle>
<config>
<bulletinLevel>WARN</bulletinLevel>
<comments></comments>
<concurrentlySchedulableTaskCount>1</concurrentlySchedulableTaskCount>
<descriptors>
<entry>
<key>Delete Attributes Expression</key>
<value>
<name>Delete Attributes Expression</name>
</value>
</entry>
<entry>
<key>Store State</key>
<value>
<name>Store State</name>
</value>
</entry>
<entry>
<key>Stateful Variables Initial Value</key>
<value>
<name>Stateful Variables Initial Value</name>
</value>
</entry>
<entry>
<key>canonical-value-lookup-cache-size</key>
<value>
<name>canonical-value-lookup-cache-size</name>
</value>
</entry>
</descriptors>
<executionNode>ALL</executionNode>
<lossTolerant>false</lossTolerant>
<penaltyDuration>30 sec</penaltyDuration>
<properties>
<entry>
<key>Delete Attributes Expression</key>
</entry>
<entry>
<key>Store State</key>
<value>Do not store state</value>
</entry>
<entry>
<key>Stateful Variables Initial Value</key>
</entry>
<entry>
<key>canonical-value-lookup-cache-size</key>
<value>100</value>
</entry>
</properties>
<runDurationMillis>0</runDurationMillis>
<schedulingPeriod>0 sec</schedulingPeriod>
<schedulingStrategy>TIMER_DRIVEN</schedulingStrategy>
<yieldDuration>1 sec</yieldDuration>
</config>
<executionNodeRestricted>false</executionNodeRestricted>
<name>UpdateAttribute</name>
<relationships>
<autoTerminate>false</autoTerminate>
<name>success</name>
</relationships>
<state>STOPPED</state>
<style/>
<type>org.apache.nifi.processors.attributes.UpdateAttribute</type>
</processors>
</snippet>
<timestamp>01/13/2021 16:41:18 UTC</timestamp>
</template>
Save this entire xml snippet to a file with the .xml extension and import it as a template in your NiFi.
Hope this helps,
Matt
Created 01-13-2021 11:42 AM
I am not able to reproduce the 1file in and 7 files out that you described.
Are you sure it is only 1 FlowFile in?
As far as your use case goes, perhaps you use RouteText first to produce the new FlowFile with all the lines containing the data you want (This includes the leading "*" and trailing "yes yes ..." strings).
To do this you would use the following regex and a "matching strategy" of "contains regular expression":
\d{1,2}\s\w{1,4}\-\w{1,4}\-(\w{1,2}|\w{0})\s\d{1,5}\s\d{1,2}\-\w{1,3}\s\d{1,2}\:\d{1,2}\:\d{1,2}
Then pass that new FlowFile to a ReplaceText processor which can trim off the leading whitespaces and "*" and any trailing whitespace characters and additional text. This ReplaceText would be configured as follows:
And use Search Value of which contains 3 capture groups:
(.*?)(\d{1,2}\s\w{1,4}\-\w{1,4}\-(\w{1,2}|\w{0})\s\d{1,5}\s\d{1,2}\-\w{1,3}\s\d{1,2}\:\d{1,2}\:\d{1,2})(.*?)$
The processor replaces text line-by-line with only the second capture group.
This worked for me to get the end result of:
Which is what i believe you are looking for in the resulting FlowFile's content.
Hope this helps,
Matt
Created on 01-13-2021 07:08 PM - edited 01-13-2021 07:09 PM
Thanks so much for going extra mile and sharing the template file. Appreciate it very much.
However, the problem is still there. I totally agree with you that the number of occurrence's found in the one input file should not split the flow files into that many times but trust me I feel like the server is haunted and is acting weird.
Can you test one more time in your test flow by making one small change to see if you can replicate the issue?
- Instead of using generate flow file processor, please copy the data from the regex site and put it inside a text file (*.txt)
- Delete the generate flow file processor and add in "GetFile" processor
- point the Input directory to the path wherein you saved that *.txt file with the content taken from regex site
- Connect the "GetFile" processor to the "Route Text" processor with the setting which was previously set i.e. Routing Strategy: Route to each matching Property Name and the Matching Strategy: Matches Regular Expression.
- Please use any of the regex expression shared before
- Run the flow
Can you please let me know if you still have one flowfile to the matched relationship? If you still say no and you get one flow file as output to matched relationship then I am sure my server in office is haunted.
I might have to write a python script and put it in executestream command processor to extract the text from file and put it in a location. Thats the workaround coming to my mind right now.
Thanks so much again Matt for all the help.
Created 01-15-2021 05:15 AM
@Fierymech
If you clear all the FlowFiles out of you test dataflow, stop all processors, and start on your GetFile processor, how many FlowFiles get queued on the success connection out of the GetFile processor? How many "out" does it show on the GetFile stats?