Support Questions

Hafiz · ‎11-30-2021

Hello experts. I have a text file reading into Nifi flows. Part of my flow is splittext > extracttext. I have split the text as line by line using SplitText Processor. Daily file is generated with the process output reads as below:

20211129-04:00:26 RG1287.kla EOF mark not found!

20211129-04:00:55 RG9625.kla EOF mark found!

....and so on

How can i configure extracttext processor to result values as attributes for the each flow file.

The values would be:

date:20211129

Group 1:RG1287.kla

Group 2:EOF mark not found!

Appreciate the help. Thanks!

MattWho · ‎12-02-2021

@Hafiz

The ExtractText processor will evaluate a Java regular expression containing capture group against the inbound FlowFile's content. Then creates FlowFile attributes by processor dynamic property name that is assigned the value from the capture group from that Java regular expression.

Above would result in FlowFiles with attributes like:

Things to keep in mind.
SplitText takes the inbound FlowFile and splits it in too many FlowFiles. If you are producing a lot of splits from a single source FlowFile, it will have an impact of NiFi's heap usage during that process. As each Split FlowFile is created, the FlowFile attributes/metadata fro each produced FlowFile (splits) is held in heap memory. Once all splits are created, all those produced Split FlowFiles are committed to the downstream relationship. One on the relationship, NiFi can then swap as needed out of heap memory. NiFi does this to avoid data duplication. Let's say you have a split that is in progress and NiFi dies. Since nothing has been committed to a downstream relationship yet, when NiFi is brought back online, it will reprocess the original FlowFile. You can reduce heap usage by splitting your source File multiple times if it is large (more than 20,000 - 50,000 splits). For example, split by every 5,000 lines in first SplitText and then by every 1 line in second SplitText.

While NiFi does not hold FlowFile content in heap memory (Some processor will load content in to heap to execute on that content), FlowFile attributes/metadata is held in heap memory. So the more attributes/metadata exists on a FlowFile, the more heap that FlowFile is going to use. FlowFiles are held in connection between processor components. NiFi has a connection swap threshold that is applied per connection. The default is to produce swap files that contain 10,000 FlowFiles each (these swap files are for FlowFile attributes/metadata and not content since it is not always held in heap). So swap default set in nifi.properties file is 20,000. This means the first swap file for a connection is generated connection reaches 20,000 queued FlowFiles on one node (if multi-node NiFi cluster, swap is per node and not across all nodes)

Just keep above in mind when designing dataflows where you are splitting/merging, creating a lot of FlowFile Attributes, or creating FlowFile attributes with large values.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

MattWho · ‎12-02-2021

@Hafiz

The ExtractText processor will evaluate a Java regular expression containing capture group against the inbound FlowFile's content. Then creates FlowFile attributes by processor dynamic property name that is assigned the value from the capture group from that Java regular expression.

Above would result in FlowFiles with attributes like:

Things to keep in mind.
SplitText takes the inbound FlowFile and splits it in too many FlowFiles. If you are producing a lot of splits from a single source FlowFile, it will have an impact of NiFi's heap usage during that process. As each Split FlowFile is created, the FlowFile attributes/metadata fro each produced FlowFile (splits) is held in heap memory. Once all splits are created, all those produced Split FlowFiles are committed to the downstream relationship. One on the relationship, NiFi can then swap as needed out of heap memory. NiFi does this to avoid data duplication. Let's say you have a split that is in progress and NiFi dies. Since nothing has been committed to a downstream relationship yet, when NiFi is brought back online, it will reprocess the original FlowFile. You can reduce heap usage by splitting your source File multiple times if it is large (more than 20,000 - 50,000 splits). For example, split by every 5,000 lines in first SplitText and then by every 1 line in second SplitText.

While NiFi does not hold FlowFile content in heap memory (Some processor will load content in to heap to execute on that content), FlowFile attributes/metadata is held in heap memory. So the more attributes/metadata exists on a FlowFile, the more heap that FlowFile is going to use. FlowFiles are held in connection between processor components. NiFi has a connection swap threshold that is applied per connection. The default is to produce swap files that contain 10,000 FlowFiles each (these swap files are for FlowFile attributes/metadata and not content since it is not always held in heap). So swap default set in nifi.properties file is 20,000. This means the first swap file for a connection is generated connection reaches 20,000 queued FlowFiles on one node (if multi-node NiFi cluster, swap is per node and not across all nodes)

Just keep above in mind when designing dataflows where you are splitting/merging, creating a lot of FlowFile Attributes, or creating FlowFile attributes with large values.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

Cloudera Community

Support Questions

NIFI Extracttext Configuration